Identifying an Off-CPU Bottleneck - Cobalt Computing Services

Data for a new microservice showed that throughput was constrained and showing signs of an off-CPU bottleneck. This post summarises the approach to investigating the problem from a Linux perspective and shows how system calls can be analysed to identify where time is being spent.

Strategic observability and tracing toolsets had not yet been implemented so we were given direct access to the hosts to allow local observability tools to be installed. The service was running on an AWS EKS node (c6g.large) as a single container (500m, 500Mi) with JVM max heap configured to 85% of the container allocation. A single load generator rated up to 9000 req/sec was used for the testing, connected using a private link to a VPC endpoint service. All data transfers and packet rates were significantly less than the AWS stress testing policy.

Throughput

The table below summarises some of the metrics produced by a series of load tests against two endpoints provided by the microservice. Tests were executed against a GET endpoint followed by tests against a POST endpoint. Mean client pacing was 1.5ms. Results show that the GET endpoint is constrained at 287 req/sec and the CPU allocation is 99% utilised. Results show that the POST endpoint is constrained at 90 req/sec and the CPU allocation is a lot lower at 37% suggesting the POST endpoint bottleneck is off-CPU.

Metric                                   GET Test  POST Test 
--------------------------------------  ---------  --------- 
Active Clients                                 10         10 
Throughput OK (req/sec)	                      287         90 
Throughput KO (req/sec)	                        0          0 
Response Time Min. (msec)                       4         22 
Response Time Mean (msec)                      32        109 
Container Count	                                1          1 
Container CPU (% allocation)                   99         37 
Container Memory RSS (% allocation)            68         69 
Container Network RX (KB/s)                   297        279
Container Network TX (KB/s)                   444        234
Logged Errors                                   0          0

Threads

Running ps on the EKS host we can find the process id of the java application to investigate. An alternative way of doing this using ctr can be found here.

$ ps -afe | grep java
185         4976    2966  0 Jan10 ?        00:47:41 java -Djava.util.logging.manager=org.jboss.logmanager.LogManager -jar -quarkus-run.jar
root     4165123 4164704  0 07:56 pts/0    00:00:00 grep --color=auto java

Using top we can see details about the threads the process is running and their CPU time. The -d flag can be used to set a delay between screen updates and the number of updates configured with -n. The -b flag can be used to output to a file for offline analysis.

$ top -H -p 4976 -d 300 -n 3 -b > top-output.txt

The output below shows a 5 minute summary when the system was under maximum load during a GET test. %CPU shows the tasks share of CPU time since the last update. The java process is using almost all of the CPU allocation for the container (CPU limit is 500 millicores, 25% CPU) and the most active threads are executor-thread, VM Thread and vert.x-eventloop threads. The executor thread and event-loop threads are processing application requests and VM thread is most likely performing garbage collection. There are 11 executor threads running, approximately 1 per active client. NOTE: only threads with %CPU greater than 0 are listed.

top - 11:05:12 up 18:00,  0 users,  load average: 1.85, 1.81, 1.49
Threads:  34 total,  10 running,  24 sleeping,   0 stopped,   0 zombie
%Cpu(s): 24.1 us,  3.2 sy,  0.0 ni, 71.6 id,  0.0 wa,  0.0 hi,  1.1 si,  0.0 st
KiB Mem :  3908232 total,   170604 free,  1408528 used,  2329100 buff/cache
KiB Swap:        0 total,        0 free,        0 used,  2322292 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
 341304 185       20   0 2779060 249384  19832 R  4.0  6.4   0:26.13 executor-thread
 331916 185       20   0 2779060 249384  19832 R  3.9  6.4   0:50.30 executor-thread
 342982 185       20   0 2779060 249384  19832 S  3.9  6.4   0:21.12 executor-thread
 331645 185       20   0 2779060 249384  19832 S  3.8  6.4   0:49.09 executor-thread
 342254 185       20   0 2779060 249384  19832 R  3.8  6.4   0:23.33 executor-thread
 345262 185       20   0 2779060 249384  19832 S  3.8  6.4   0:14.70 executor-thread
 342255 185       20   0 2779060 249384  19832 R  3.7  6.4   0:23.00 executor-thread
 338001 185       20   0 2779060 249384  19832 R  3.6  6.4   0:33.54 executor-thread
 331222 185       20   0 2779060 249384  19832 R  3.3  6.4   0:46.87 executor-thread
 330952 185       20   0 2779060 249384  19832 R  2.9  6.4   0:47.67 executor-thread
   4683 185       20   0 2779060 249384  19832 R  2.8  6.4   0:38.15 VM Thread
 347967 185       20   0 2779060 249384  19832 R  2.2  6.4   0:06.50 executor-thread
   4773 185       20   0 2779060 249384  19832 R  1.7  6.4   0:34.86 vert.x-eventloo
   4772 185       20   0 2779060 249384  19832 R  1.7  6.4   0:34.09 vert.x-eventloo
 349398 185       20   0 2779060 249384  19832 R  1.1  6.4   0:03.31 executor-thread
   4692 185       20   0 2779060 249384  19832 R  0.3  6.4   0:04.12 Notification Th

The output below shows a 5 minute summary when the system was under maximum load during a POST test. The snapshot shows a different profile to the GET test. As before, the event loop, executor and VM threads are the most active out of the entire set, however, the executor threads are on CPU for significantly less time and the vert.x-eventloop thread is on CPU significantly more time, for one of the threads. The total thread count is similar, but the process ids of all the executor threads are different indicating they are created on demand. When the snapshot was taken all threads are in an interruptible sleep state.

top - 12:15:37 up 19:10,  0 users,  load average: 0.19, 0.30, 0.68
Threads:  36 total,   0 running,  36 sleeping,   0 stopped,   0 zombie
%Cpu(s):  9.9 us,  3.2 sy,  0.0 ni, 85.2 id,  0.0 wa,  0.0 hi,  1.7 si,  0.0 st
KiB Mem :  3908232 total,   273392 free,  1275780 used,  2359060 buff/cache
KiB Swap:        0 total,        0 free,        0 used,  2455464 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
   4772 185       20   0 2783732 303640  19832 S  3.6  7.8   1:34.86 vert.x-eventloo
   4773 185       20   0 2783732 303640  19832 S  1.7  7.8   1:34.09 vert.x-eventloo
 401353 185       20   0 2783732 303640  19832 S  1.3  7.8   0:21.12 executor-thread
 401762 185       20   0 2783732 303640  19832 S  1.3  7.8   0:49.09 executor-thread
 404179 185       20   0 2783732 303640  19832 S  1.3  7.8   0:23.33 executor-thread
 402481 185       20   0 2783732 303640  19832 S  1.2  7.8   0:14.70 executor-thread
 403638 185       20   0 2783732 303640  19832 S  1.2  7.8   0:23.00 executor-thread
 405708 185       20   0 2783732 303640  19832 S  1.1  7.8   0:33.54 executor-thread
 403048 185       20   0 2783732 303640  19832 S  1.1  7.8   0:46.87 executor-thread
 401343 185       20   0 2783732 303640  19832 S  1.1  7.8   0:47.67 executor-thread
 406398 185       20   0 2783732 303640  19832 S  1.0  7.8   0:38.15 executor-thread
 402916 185       20   0 2783732 303640  19832 S  0.8  7.8   0:06.50 executor-thread
 411472 185       20   0 2783732 303640  19832 S  0.4  7.8   0:34.86 executor-thread
   4683 185       20   0 2783732 303640  19832 S  0.3  7.8   1:34.09 VM Thread
   4689 185       20   0 2783732 303640  19832 S  0.1  7.8   0:03.31 C2 CompilerThre

The output from top shows that the process is running a set of threads for the vert.x reactive application toolkit. Vert.x uses an event based architecture that runs an event-loop thread for non-blocking event dispatching, and executor threads for longer running blocking requests. It also runs an acceptor thread for handling client requests on a socket and a blocked thread checker for monitoring the event loop and executor threads. Event loop thread count defaults to 2 * cores, hence two threads for the above process. Maximum event loop thread execute time is 2 seconds and the executor thread is 60 seconds. See VertxOptions for more details. It provides a set of modules for building reactive applications and it uses Netty for the NIO event loop. Top shows that worker threads have been created which suggests they are handling long running or blocking tasks. See quarkus dispatching strategy for further information.

System Calls

Output from top shows that the executor threads are on CPU a lot less in the POST test. When a thread goes off CPU it can be waiting for events such as network I/O, disk I/O, scheduling, lock acquisition, or work to process. These type of events involve system calls to the kernel. Using perf trace we can obtain a summary of the system calls the executor thread and event loop threads are making to find out more about their activity. The command below was run for 10 seconds

$ perf trace -s -p 4976 -o perf-trace-output.txt

The output below shows system calls made by one of the executor threads during a POST endpoint test at maximum load. This shows the executor thread is spending most of its time waiting for I/O on file descriptors using ppoll and shows significant differences between the average and maximum values reported. Errors are also shown for read and futex calls. The summary shown is representative of the rest of the executor threads.

Summary of events:

executor-thread (496150), 19974 events, 7.3%

  syscall            calls errors  total       min       avg       max   stddev
                                   (msec)    (msec)    (msec)    (msec)    (%)
  --------------- -------- ------ -------- --------- --------- --------- ------
  ppoll               1465     0 20360.553     0.000    13.898   389.397  7.48%
  futex               1768    45  2294.950     0.002     1.298    23.692  4.45%
  write               2316     0    55.680     0.003     0.024     0.158  1.04%
  read                4437  1465    43.308     0.002     0.010     5.931 13.73%
  gettid                 1     0     0.010     0.010     0.010     0.010  0.00%

The output below shows top 5 system calls for the event loop threads during the same test in the same sample window. This shows the threads are spending most of their time waiting for I/O on file descriptors using epoll_pwait and that the event loop threads are also writing to sockets. It also shows some errors for the futex calls as well as significant differences between the average and maximum values reported for epoll_pwait. One of the event loop threads processed more events than the other and logged 1027 futex errors.

Summary of events:

vert.x-eventloo (5055), 40318 events, 14.8%

  syscall            calls errors  total       min       avg       max   stddev
                                   (msec)    (msec)    (msec)    (msec)    (%)
  --------------- -------- ------ -------- --------- --------- --------- ------
  epoll_pwait         5609     0 20214.774     0.000     3.604    97.388  2.75%
  futex               5373   145    95.882     0.002     0.018     4.787 12.99%
  read                5623     0    51.278     0.002     0.009     1.082  2.62%
  write               2408     0    41.232     0.003     0.017     0.298  1.63%
  writev               835     0    24.772     0.011     0.030     0.251  1.50%


vert.x-eventloo (5056), 54052 events, 19.8%

  syscall            calls errors  total       min       avg       max   stddev
                                   (msec)    (msec)    (msec)    (msec)    (%)
  --------------- -------- ------ -------- --------- --------- --------- ------
  epoll_pwait         7658     0 19652.526     0.000     2.566    85.186  2.06%
  futex               7310  1027   146.670     0.002     0.020     6.561  6.29%
  read                7760     0    75.407     0.002     0.010     0.394  1.28%
  write               2825     0    60.900     0.003     0.022     2.496  4.25%
  writev              1008     0    31.529     0.012     0.031     0.395  1.49%

Whilst some of the system calls made by each thread type have high maximum values. None of the calls show an average timing that equates to the 109 ms round trip times measured in the test, but they do indicate that a lot of time is spent in file descriptor I/O.

System Trace

Using strace we can identify the sequence of system calls and arguments being passed when processing a request. POST requests were executed sequentially and traced with no other activity on the system. In this scenario we are using strace as by default it shows request data being passed which will make it easier to trace the requests. Similar information can be obtained using perf which is less intrusive, as caution should be used running strace on a live environment. When executing strace under maximum load a 25% reduction in throughput was observed.

$ strace -rttTf -p 3923004 -o strace-output.txt

The output of strace can be filtered to find the event loop thread and executor thread that processed the client request. To focus on file descriptor I/O, the futex calls have been omitted and for ease of reading some of the read and write data has been replaced with “<data>”. An ongoing system call is marked as unfinished and resumed by strace if another system call is made from a different process/thread.

5069  11:49:11.129865 (+    0.004709) read(17, "POST /servicea HTTP/1."..., 288) = 274 <0.000026>
5069  11:49:11.130072 (+    0.000013) epoll_pwait(156, <unfinished...>
1613992  11:49:11.130144 (+    0.000061) write(157, "\1\0\0\0\0\0\0\0", 8) = 8 <0.000052>
5069  11:49:11.130218 (+    0.000076) <... epoll_pwait resumed> [{EPOLLIN, {u32=157, u64=281470681743517}}], 1024, 613, NULL, 8) = 1 <0.000139>
5069  11:49:11.130247 (+    0.000012) read(157, "\1\0\0\0\0\0\0\0", 16) = 8 <0.000038>
5069  11:49:11.130408 (+    0.000009) epoll_pwait(156, <unfinished...>
1613992  11:49:11.130476 (+    0.000024) write(157, "\1\0\0\0\0\0\0\0", 8) = 8 <0.000025>
5069  11:49:11.130569 (+    0.000025) <... epoll_pwait resumed> [{EPOLLIN, {u32=157, u64=281470681743517}}], 1024, 612, NULL, 8) = 1 <0.000153>
5069  11:49:11.130599 (+    0.000028) read(157, "\1\0\0\0\0\0\0\0", 16) = 8 <0.000024>
1613992  11:49:11.130944 (+    0.000197) write(157, "\1\0\0\0\0\0\0\0", 8 <unfinished ...>
5069  11:49:11.130971 (+    0.000023) epoll_pwait(156, <unfinished...>
1613992  11:49:11.131005 (+    0.000035) <... write resumed> ) = 8 <0.000043>
5069  11:49:11.131019 (+    0.000011) <... epoll_pwait resumed> [{EPOLLIN, {u32=157, u64=281470681743517}}], 1024, 612, NULL, 8) = 1 <0.000034>
5069  11:49:11.131053 (+    0.000012) read(157, "\1\0\0\0\0\0\0\0", 16) = 8 <0.000014>
5069  11:49:11.131123 (+    0.000072) write(20, "GET /serviceb/123"..., 232) = 232 <0.000056>
5069  11:49:11.131229 (+    0.000107) epoll_pwait(156, [{EPOLLIN, {u32=20, u64=281470681743380}}], 1024, 612, NULL, 8) = 1 <0.003947>
5069  11:49:11.135213 (+    0.003983) read(20, "HTTP/1.1 200 \r\nVary: Origin\r\n"..., 16384) = 2240 <0.000030>
1613992  11:49:11.135756 (+    0.000164) write(27, "<data>"..., 144 <unfinished ...>
5069  11:49:11.135789 (+    0.000029) epoll_pwait(156, <unfinished...>
1613992  11:49:11.135819 (+    0.000030) <... write resumed> ) = 144 <0.000039>
1613992  11:49:11.135840 (+    0.000020) read(27, 0xffff7ca3b5a0, 5) = -1 EAGAIN (Resource temporarily unavailable) <0.000027>
1613992  11:49:11.135899 (+    0.000062) ppoll([{fd=27, events=POLLIN}], 1, NULL, NULL, 0) = 1 ([{fd=27, revents=POLLIN}]) <0.001360>
1613992  11:49:11.137309 (+    0.001408) read(27, "<data>", 5) = 5 <0.000023>
1613992  11:49:11.137363 (+    0.000054) read(27, "<data>"..., 208) = 208 <0.000030>
1613992  11:49:11.137529 (+    0.000167) write(27, "<data>"..., 136) = 136 <0.000050>
1613992  11:49:11.137615 (+    0.000083) read(27, 0xffff7ca3b5a0, 5) = -1 EAGAIN (Resource temporarily unavailable) <0.000029>
1613992  11:49:11.137674 (+    0.000058) ppoll([{fd=27, events=POLLIN}], 1, NULL, NULL, 0) = 1 ([{fd=27, revents=POLLIN}]) <0.001264>
1613992  11:49:11.138981 (+    0.001307) read(27, "<data>", 5) = 5 <0.000022>
1613992  11:49:11.139030 (+    0.000049) read(27, "<data>"..., 49) = 49 <0.000028>
1613992  11:49:11.139130 (+    0.000101) write(27, "<data>"..., 140) = 140 <0.000059>
1613992  11:49:11.139223 (+    0.000091) read(27, 0xffff7ca3b5a0, 5) = -1 EAGAIN (Resource temporarily unavailable) <0.000027>
1613992  11:49:11.139280 (+    0.000056) ppoll([{fd=27, events=POLLIN}], 1, NULL, NULL, 0) = 1 ([{fd=27, revents=POLLIN}]) <0.001193>
1613992  11:49:11.140523 (+    0.001243) read(27, "<data>", 5) = 5 <0.000023>
1613992  11:49:11.140576 (+    0.000054) read(27, "<data>"..., 481) = 481 <0.000031>
1613992  11:49:11.140755 (+    0.000180) write(27, "<data>"..., 148) = 148 <0.000049>
1613992  11:49:11.140839 (+    0.000081) read(27, 0xffff7ca3b5a0, 5) = -1 EAGAIN (Resource temporarily unavailable) <0.000029>
1613992  11:49:11.140898 (+    0.000058) ppoll([{fd=27, events=POLLIN}], 1, NULL, NULL, 0) = 1 ([{fd=27, revents=POLLIN}]) <0.002336>
1613992  11:49:11.143277 (+    0.002379) read(27, "<data>", 5) = 5 <0.000023>
1613992  11:49:11.143327 (+    0.000049) read(27, "<data>"..., 49) = 49 <0.000029>
1613992  11:49:11.143436 (+    0.000110) write(27, "<data>"..., 62) = 62 <0.000053>
1613992  11:49:11.143537 (+    0.000100) read(27, 0xffff7ca3b5a0, 5) = -1 EAGAIN (Resource temporarily unavailable) <0.000023>
1613992  11:49:11.143588 (+    0.000051) ppoll([{fd=27, events=POLLIN}], 1, NULL, NULL, 0) = 1 ([{fd=27, revents=POLLIN}]) <0.001193>
1613992  11:49:11.144825 (+    0.001236) read(27, "<data>", 5) = 5 <0.000023>
1613992  11:49:11.144878 (+    0.000053) read(27, "<data>"..., 68) = 68 <0.000028>
1613992  11:49:11.145084 (+    0.000207) write(27, "<data>"..., 405) = 405 <0.000050>
1613992  11:49:11.145171 (+    0.000084) read(27, 0xffff7ca3b5a0, 5) = -1 EAGAIN (Resource temporarily unavailable) <0.000030>
1613992  11:49:11.145231 (+    0.000059) ppoll([{fd=27, events=POLLIN}], 1, NULL, NULL, 0) = 1 ([{fd=27, revents=POLLIN}]) <0.001333>
1613992  11:49:11.146607 (+    0.001376) read(27, "<data>", 5) = 5 <0.000022>
1613992  11:49:11.146657 (+    0.000049) read(27, "<data>"..., 51) = 51 <0.000028>
1613992  11:49:11.146806 (+    0.000150) write(27, "<data>"..., 60) = 60 <0.000054>
1613992  11:49:11.146894 (+    0.000086) read(27, 0xffff7ca3b5a0, 5) = -1 EAGAIN (Resource temporarily unavailable) <0.000031>
1613992  11:49:11.146954 (+    0.000060) ppoll([{fd=27, events=POLLIN}], 1, NULL, NULL, 0) = 1 ([{fd=27, revents=POLLIN}]) <0.003337>
1613992  11:49:11.150342 (+    0.003388) read(27, "<data>", 5) = 5 <0.000025>
1613992  11:49:11.150400 (+    0.000058) read(27, "<data>"..., 47) = 47 <0.000030>
1613992  11:49:11.150573 (+    0.000172) write(1, "<data>"..., 196) = 196 <0.000073>
1613992  11:49:11.150887 (+    0.000313) write(157, "\1\0\0\0\0\0\0\0"..., 8) = 8 <0.000048>
5069  11:49:11.150965 (+    0.000075) <... epoll_pwait resumed> [{EPOLLIN, {u32=157, u64=281470681743517}}], 1024, 607, NULL, 8) = 1 <0.015166>
5069  11:49:11.150996 (+    0.000012) read(157, "\1\0\0\0\0\0\0\0", 16) = 8 <0.000020>
5069  11:49:11.151083 (+    0.000094) writev(17, [{iov_base="HTTP/1.1 201 Created\r\nContent-Ty"..., iov_len=254}, {iov_base="<data>"
5069  11:49:11.151217 (+    0.006079) epoll_pwait(156, [{EPOLLIN, {u32=17, u64=281470681743377}}], 1024, 592, NULL, 8) = 1 <0.006079>

File Descriptors

The trace above shows that the event loop thread and executor threads are reading and writing to multiple file descriptors. Using lsof we can find out what type they are. The event loop thread is using eventpoll 156 for I/O notification on multiple descriptors, eventfd 157 for application wait events and descriptors 17 and 20 for network I/O.

$ lsof | grep -e 5069 | grep -e " 17[urw] " -e " 20[urw] " -e " 156[urw] " -e " 157[urw] "
vert.x-ev   4976 5069       185   17u     sock                0,8       0t0    9005252 protocol: TCPv6
vert.x-ev   4976 5069       185   20u     sock                0,8       0t0    9004984 protocol: TCPv6
vert.x-ev   4976 5069       185  156u  a_inode               0,13         0      11657 [eventpoll]
vert.x-ev   4976 5069       185  157u  a_inode               0,13         0      11657 [eventfd]

The executor thread is using eventfd 157 for application notification events, descriptor 1 for stdout logging and descriptor 27 for network I/O.

$ lsof | grep -e 1613992 | grep -e " 1[urw] " -e " 27[urw] " -e " 157[urw] "
executor-   4976 1613992    185    1w     FIFO               0,12       0t0      34511 pipe
executor-   4976 1613992    185   27u     sock                0,8       0t0    8978747 protocol: TCPv6
executor-   4976 1613992    185  157u  a_inode               0,13         0      11657 [eventfd]

Using nsenter to run ss in the namespace of the process we can find out what the socket connections are. Descriptor 17 is the connection with the requesting client, descriptor 20 is a connection to an external service on port 8090, and descriptor 27 is a connection to a postgres database.

$ nsenter -t 4976 -n ss -i -p | grep -e "fd=17" -e "fd=20" -e "fd=27"
tcp   ESTAB 0      0      [::ffff:xx.x.xxx.xxx]:37636    [::ffff:xx.x.xxx.xxx]:postgres users:(("java",pid=4976,fd=27))
tcp   ESTAB 0      1578   [::ffff:xx.x.xxx.xxx]:webcache [::ffff:xx.x.xxx.xxx]:16828 users:(("java",pid=4976,fd=17))
tcp   ESTAB 0      0      [::ffff:xx.x.xxx.xxx]:60508    [::ffff:xx.x.xxx.xxx]:8090 users:(("java",pid=4976,fd=20))

Call Sequence

Combining the trace output with the file descriptor information we can see that the event loop receives the client request on a tcp socket (fd=17) and returns the final response to the client on the socket (fd=17). The entire request took 21.2 ms to process in an uncontended scenario. The event loop thread makes a call to service b (fd=20) which took 4.1 ms and when the response is received the executor thread makes 7 calls to a postgres database (fd=27) totalling 15.2 ms before the response is returned to the client. The initial database read fails with EAGAIN as no data has been returned to read, which this is the source of the executor thread read errors in the perf summary above. ppoll is used to wait for data to read with a timeout specified as NULL which will block indefinitely. The event loop thread uses epoll for I/O notification and monitors eventpoll descriptor (fd=156) for socket I/O from the client request (fd=17), and service b request (fd=20), as well as event notification from the executor thread on eventfd (fd=157). The writev system call is used to return the response back to the client (fd=17).

Database Latency

In the trace above a single request made 7 round trips to the database. Referring back to the summary report from perf trace the ratio of ppoll system calls to writev system calls across all threads is 6.9 which approximately aligns with the single request trace. With an average ppoll latency of 14.51 ms each client request spent 101.6 ms waiting for a response from the database. This accounts for 93% of the 109 ms response time observed under load. The remaining 7% comprises the call to service b (4 ms) as well as latency between the test tools and the microservice.

executor-thread
 ppoll               1465      0 20360.553     0.000    13.898   389.397      7.48%
 ppoll               1460      0 20091.517     0.000    13.761   336.062      6.69%
 ppoll               1440      0 20553.106     0.000    14.273   302.391      6.73%
 ppoll               1462      0 20236.413     0.000    13.842   384.526      7.00%
 ppoll               1341      0 20813.370     0.000    15.521   316.937      6.91%
 ppoll               1366      0 20612.547     0.000    15.090   444.624      7.41%
 ppoll               1441      0 20386.041     0.000    14.147   383.300      6.99%
 ppoll               1421      0 20578.373     0.000    14.482   412.241      7.25%
 ppoll               1315      0 20852.107     0.000    15.857   396.901      7.34%
vert.x-eventloop
 writev               835      0    24.772     0.011     0.030     0.251      1.50%
 writev              1008      0    31.529     0.012     0.031     0.395      1.49%

Blocking Calls

The output from strace shows that the call to service b was made from the event loop thread, and although blocked thread warnings are not being reported by vert.x during testing (as the round trip time is less than the 2000 ms threshold), we recommended that this is reviewed. Event loop threads run in a sequential way and are designed to run one thread per processor to minimise context switching overhead. For this model to work blocking calls must not be made from event loop threads (see vert.x golden rule). The development team subsequently confirmed that REST client getById method is being used which Quarkus recommend should not be invoked from the event loop thread and instead be made using the async call.

Conclusion

Using the tools above we have been able to identify that:

Threads are waiting off-CPU because requests to the database are slowing down under load. The amount of time spent waiting for a response from the database begins to increase when the database reaches a limit. This is compounded by the fact that there are 7 calls per API request.
Blocking calls being made from the event loop threads present a further potential bottleneck within Vert.x impacting both GET and POST requests if service b slows down. Following on from this, if calls to service B are made non-blocking, service B may then become a bottleneck on POST request throughput.

This gives the following questions for further investigation :

Why are the database calls slowing down under load ?
What diagnostic information can we get from the database ?
Can we monitor network latency to the database ?
What level of throughput does the POST endpoint need to support ?
What type of database workload does the POST endpoint generate, is this a write heavy workload ?
Are there any options to reduce the visit count ?
Can the blocking call made to service B be replaced with a non-blocking asynchronous version ?
What happens to service B if throughput is increased ? does service B become a bottleneck ?

Other points to note were:

Some of the tools that were used create a snapshot of system activity at a particular point in time. It was useful to combine these into a script to capture output from multiple tools at the same point and when sampling will be repeated.
The summary report produced by perf gives summary statistics for each system call. If some of the system calls perform I/O on different file descriptor types or show a lot of tail latency it can be useful to view the latency distribution using funclatency in BCC tools. Other tools in this repository such as offwaketime are also useful to track off cpu time.