This summarises an approach to identify an off-CPU bottleneck using some powerful and freely available Linux tools. Thread activity was traced for a new Java microservice and mapped to the application architecture and threading model.
Data for a new Java microservice showed significantly lower throughput for HTTP POST requests than GET requests. This summarises an approach to identify an off-CPU bottleneck using some powerful and freely available Linux tools.
Application
The diagram below outlines primary interactions of the service under test (Service A). Service A is a new microservice built with Quarkus for running in a containerised environment. Add protocol…..

Environment
Dedicated test environments were created on AWS to test the service. Service A was hosted on a single node (c6g.large) AWS EKS cluster with sufficient capacity to run 2x 500m, 500Mi containers per node. The service was set to auto-scale containers at 80% CPU, and JVM max heap configured to 85% of the container memory allocation. Fully working and tested instances of service B and the database were provisioned on their own dedicated hosting and seeded with data that was the size, shape and volume of production. No caching was implemented for reads against service B or the database. Test tools were connected using an AWS VPC private link endpoint to an ELB. A single load generator rated up to 9000 req./sec was used for the testing. All data transfers and packet rates were significantly less that the AWS stress testing policy.
Throughput
The table below summarises some of the metrics produced by a series of stress tests against service A. Results show throughput and CPU utilisation are a lot lower for the POST test than the GET test, which suggests that the application is spending time waiting for an event. Results also show the POST request tests do not trigger container auto-scaling because of the lower CPU utilisation and they take longer to complete, with significant queuing at max. throughput.
Metric GET POST
Stress Test Stress Test
------------------------------------------------ ------------- -------------
Response Time Min. (ms) 4 12
Max. System Throughput (Req/Sec) 594 78
Response Time Mean @ Max. System Throughput (ms) 86 299
Container Count @ Max. System Throughput 2 1
Container Max. Throughput (Ops/sec) 303 78
Container CPU @ Max. Throughput (%) 99 53
JVM Heap @ Max. Throughput (MB) 121 85
Application Errors 0 0
Bottleneck on-CPU off-CPU
Threads
To investigate why throughput and CPU are lower during the POST test the tools below were used to obtain more information when the system was under load. As some of the tools create a snapshot at a particular point in time it was useful to combine these into a script for sampling and offline analysis. EKS is running containerd so we can get the process id of the container we are interested in using ctr as below.
$ for i in `ctr --namespace k8s.io.containers ls | grep <name> | cut -d' ' -f1;
do
ctr --namespace k8s.io task ist | grep $i;
done
a087b8b5e433f80114dc5104f818d1a183af48bdb23774f0e28f9ee4b43ab5ae 1234567 RUNNING
Using ps (ref[1]) we can see details about threads the process is running and their commands. The output below shows that the application is running a set of threads for the memory management and functions of the JVM, a set of threads for the vert.x reactive application toolkit, and a set of threads that are part of the java executor pool and JDBC connectivity.
$ ps -T -p 1234567 -o stat,comm --no-header | sort | uniq -c | sort -n -r
51 Ssl executor-thread
2 Ssl vert.x-internal
2 Ssl vert.x-eventloo
2 Ssl Transaction
2 Ssl java
1 Ssl VM
1 Ssl VM
1 Ssl vertx-blocked-t
1 Ssl vert.x-acceptor
1 Ssl Sweeper
1 Ssl SmallRye
1 Ssl Signal
1 Ssl Service
1 Ssl Reference
1 Ssl PostgreSQL-JDBC
1 Ssl Notification
1 Ssl Monitor
1 Ssl Finalizer
1 Ssl Common-Cleaner
1 Ssl C2
1 Ssl C1
1 Ssl agroal-11
Vert.x [ref. 2] is an event based architecture that uses an event-loop thread for non-blocking event dispatching and executor threads for longer running blocking requests. It provides a set of modules for building reactive applications and it uses Netty for the NIO event loop. We can see from the output above that 51 worker threads have been created which suggests they are handling long running or blocking tasks. All of the threads except one were in an interruptible sleep state (waiting for an event).
System Calls
When a thread goes off CPU it can be for a number of reasons but primarily it is network I/O, disk I/O, acquiring a lock, or waiting for work. Using perf trace we can obtain a summary of the system calls each thread is making underload to find out more about their activity.
The table below shows top 5 system calls by total time for each thread during a GET stress test and a POST stress test. The data shows more time was spent waiting for file descriptor I/O during the POST test than the GET test, and the event-loop thread is making more read and write calls in the POST test. A key difference being the fact that the POST test makes a call to Service B.
GET POST
thread syscall calls total calls total
(msec) (msec)
--------------------- --------------- -------- ----------- -------- -----------
vert.x-acceptor epoll_pwait 48 28791.242 65 25531.706
futex 14 150.530 132 2.093
accept 99 10.856 201 1.652
setsockopt 153 9.549 134 0.865
vert.x-eventloop futex 4112 7257.119 6857 83.480
epoll_pwait 405 744.088 7059 19843.399
read 3555 584.133 7182 65.832
writev 3452 463.772 1051 31.407
write 6 1.022 3543 63.328
vert.x-eventloop futex 4112 7257.119 6857 83.480
epoll_pwait 405 744.088 7059 19843.399
read 3555 584.133 7182 65.832
writev 3452 463.772 1051 31.407
write 6 1.022 3543 63.328
executor-thread futex 980 22990.778 318 437.100
ppoll 119 444.694 269 23991.239
write 418 173.901 423 9.966
read 531 103.196 814 6.302
internal
block
We can see the event-loop thread is monitoring descriptors using epoll_wait system call and the executor-thread using ppoll. As this is an HTTP service that makes calls Service B and Postgres, it is likely that the event-loop and executor threads are waiting on network I/O.
Running strace we can obtain further detail about the system calls above and the data being sent. NOTE: in this example we use strace for a short 10s window, however similar information can be obtained using perf which is also less intrusive.
$ strace -rttTf -p 1234567 -o strace-output.txt
3393265 17:47:03.760375 (+ 0.000200) futex(0xffff4c1b44b8, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
1246868 17:47:03.760399 (+ 0.000019) <... futext resumed> ) = 1 <0.000243>
1246868 17:47:03.760453 (+ 0.000019) write(22, "GET /serviceb/123" ..., 232 <unfinished ...>
3393613 17:47:03.760524 (+ 0.000070) <... ppoll resumed> ) = 1 ([{fd=69, revents=POLLIN}]) <0.001514>
3393613 17:47:03.762510 (+ 0.000070) ppoll([{fd=69, events=POLLIN}], 1, NULL, NULL, 0) = 1 ([{fd=69, revents=POLLIN}]) <0.001717)
1246868 17:47:03.765582 (+ 0.000035) read(22, "HTTP1.1 200 \r\nVary: Origin\r\nVar" ..., 4096) = 2083 <0.000011>
The output above can be analysed to find out the file descriptors being used by the event-loop threads and the executor threads. The table below summarises the output.
Using nsenter and fd, lsof and ss we can find out what type the file descriptors are. From lsof and fd we can see that fd=157 and fd=159 are eventfd file descriptors, fd=1 is /dev/null with the remaining descriptors being sockets. The output of ss shows that the eventloop threads (1246868, 1246869) were writing and reading to 5 TCP sockets connected to service b (fd=22, fd=87,fd=125, fd=130, fd=131) and the executor thread (3393288) is writing and reading to TCP sockets connected to postgres (1 per thread). fd=1 is (get which process id for the service in the container (this is one, then get the file descriptors, then run ss).
# find the proces id of java in the container
$ nsenter -t 1234567 -n ls -lrt /proc/*/exe
# list the file descriptors for that process id
$ nsenter -t 1234567 -n ls -lrt /proc/1/fd > fd-output.txt
#get socket statistics for the container
#note this will need to be run more than once if we want to capture any transient connections
$ nsenter -t 1234567 -n ss -i -p > ss-output.txt.`date + %s`
#get details for file descriptors used for socket I/O
$ grep -e "fd=125" -e "fd=130" -e "fd=131" -e "fd=132" -e "fd=22" -e "fd=87" -e "fd=43" ss-output.txt.*
tcp ESTAB 0 0 [::ffff:xxx.xxx.xxx.xxx]:48862 [::ffff:xxx.xxx.xxx.xxx]:postgres users:(("java",pid=1234567,fd=43))
tcp ESTAB 0 0 [::ffff:xxx.xxx.xxx.xxx]:34370 [::ffff:xxx.xxx.xxx.xxx]:8090
users:(("java",pid=1234567,fd=22))
tcp ESTAB 0 0 [::ffff:xxx.xxx.xxx.xxx]:34346 [::ffff:xxx.xxx.xxx.xxx]:8090
users:(("java",pid=1234567,fd=87))
tcp ESTAB 0 0 [::ffff:xxx.xxx.xxx.xxx]:34349 [::ffff:xxx.xxx.xxx.xxx]:8090
users:(("java",pid=1234567,fd=125))
tcp ESTAB 0 0 [::ffff:xxx.xxx.xxx.xxx]:34360 [::ffff:xxx.xxx.xxx.xxx]:8090
users:(("java",pid=1234567,fd=130))
tcp ESTAB 0 0 [::ffff:xxx.xxx.xxx.xxx]:34363 [::ffff:xxx.xxx.xxx.xxx]:8090
users:(("java",pid=1234567,fd=131))
tcp ESTAB 0 0 [::ffff:xxx.xxx.xxx.xxx]:34371 [::ffff:xxx.xxx.xxx.xxx]:8090
users:(("java",pid=1234567,fd=132))
Combining this with the output from offwaketime we have a better understanding of the data flow and calls being made. A POST request is dispatched to a worker thread by the event loop thread, and the worker thread makes a call to service b via the eventloop thread and waits for the response. All worker threads are making requests to service b via the event loop threads.
The table below shows a comparison of system calls made by the application during a GET stress test and a POST stress test (check the below numbers align with the ones above). The tests show different profiles; more time was spent waiting for file descriptor I/O during the POST test than the GET test, and the eventloop thread is making more read and write calls in the POST test. A key difference being the fact that the POST test makes a call to Service B.
The worker thread is Next check time and concurrency on the event loop threads.
Code
Eventloop threads run in a sequential way and are designed to run one thread per processor to minimise context switching overhead. For this model to work blocking calls must not be made from the event loop threads (https://vertx.io/docs/vertx-core/java/#golden_rule). Yet this is what appears to be happening.
The data we obtained at runtime shows that the eventloop threads for Service A are spending a lot of time waiting on network I/O to Service B.
Service A is using the RestEASY
Why are the eventloop threads making blocking calls to Service B?
Why don’t we see messages from the blocked thread checker ?
References
[1] Linux /proc man page: https://man7.org/linux/man-pages/man5/proc.5.html
[2] Vertx architecture: https://vertx.io/introduction-to-vertx-and-reactive/