Issue
We have three Kafka brokers version 1.1.1 on c5.4xlarge AWS instances. There is currently about 750 topics and about 1800 partitions. For certain reasons we cannot upgrade soon. One team connected to these brokers has mentioned that occasionally, the time to receive an ACK (with acks=1
) takes over 10 seconds. The p95 response time of the ACKS is about 11ms. I'm a bit confused as to what could be causing this.
We are running the following version of Java
openjdk version "1.8.0_382"
OpenJDK Runtime Environment (build 1.8.0_382-b05)
OpenJDK 64-Bit Server VM (build 25.382-b05, mixed mode)
You can see the delay on the producer side in these logs. This is a test producer that sends messages synchronously by waiting on the future with send().get()
, through the Kafka Spring client.
2023-12-05 19:27:15.036 TRACE 1664 --- [ | TestProducer] o.a.k.clients.producer.internals.Sender : [Producer clientId=TestProducer] Nodes with data ready to send: [kafka1:6667 (id: 1001 rack: us-west-2a)]
2023-12-05 19:27:15.036 TRACE 1664 --- [ | TestProducer] o.a.k.clients.producer.internals.Sender : [Producer clientId=TestProducer] Sent produce request to 1001: (type=ProduceRequest, acks=1, timeout=30000, partitionRecords=({test-producer-2=[(record=DefaultRecord(offset=0, timestamp=1701804435036, key=0 bytes, value=3 bytes))]}), transactionalId=''
2023-12-05 19:27:25.062 TRACE 1664 --- [ | TestProducer] o.a.k.clients.producer.internals.Sender : [Producer clientId=TestProducer] Received produce response from node 1001 with correlation id 59645
2023-12-05 19:27:25.062 TRACE 1664 --- [ | TestProducer] o.a.k.c.p.internals.ProducerBatch : Successfully produced messages to test-producer-2 with base offset 283532.
2023-12-05 19:27:25.062 INFO 1664 --- [ scheduling-1] com.company.kafkatest.Service : Publish completed in 10025ms, Start: 2023-12-05T19:27:15.036Z, End: 2023-12-05T19:27:25.062Z. Metadata test-producer-2@283532
On the broker side, we can see the request log which was entered at 2023-12-05 13:27:25.062
Completed request:RequestHeader(apiKey=PRODUCE, apiVersion=5, clientId=TestProducer, correlationId=59645) -- {acks=1,timeout=30000,numPartitions=1},response:{responses=[{topic=produce-test,partition_responses=[{partition=2,error_code=0,base_offset=283532,log_append_time=-1,log_start_offset=91996}]}],throttle_time_ms=0} from connection 10.0.0.13:6667-[redacted broker IP]:37072-52576;totalTime:15.921,requestQueueTime:0.025,localTime:0.51,remoteTime:0.0,throttleTime:0.017,responseQueueTime:15.174,sendTime:0.213,securityProtocol:SASL_OAUTHBEARER,principal:User:[redacted],listener:SASL_OAUTHBEARER (kafka.request.logger)
Here are some of the non-default configs we have on the broker side:
num.partitions=3
num.recovery.threads.per.data.dir=32
# replica timings
max.incremental.fetch.session.cache.slots=20000
replica.lag.time.max.ms=90000
num.replica.fetchers=8
num.io.threads=16
num.network.threads=32
heartbeat.interval.ms=9000
session.timeout.ms=30000
The producer using library version 1.1.1
is using all the defaults and acks=1
. Here is the producer config that is printed in the logs:
2023-12-05 19:24:58.971 INFO 1664 --- [ scheduling-1] o.a.k.clients.producer.ProducerConfig : ProducerConfig values:
acks = 1
batch.size = 16384
bootstrap.servers = [kafka1:6667, kafka2:6667, kafka3:6667]
buffer.memory = 33554432
client.id = TestProducer
compression.type = none
connections.max.idle.ms = 540000
enable.idempotence = false
interceptor.classes = []
key.serializer = class org.apache.kafka.common.serialization.StringSerializer
linger.ms = 0
max.block.ms = 60000
max.in.flight.requests.per.connection = 5
max.request.size = 1048576
metadata.max.age.ms = 300000
metric.reporters = []
metrics.num.samples = 2
metrics.recording.level = INFO
metrics.sample.window.ms = 30000
partitioner.class = class org.apache.kafka.clients.producer.internals.DefaultPartitioner
receive.buffer.bytes = 32768
reconnect.backoff.max.ms = 1000
reconnect.backoff.ms = 50
request.timeout.ms = 30000
retries = 0
retry.backoff.ms = 100
sasl.jaas.config = [hidden]
[removed some sasl configs]
sasl.kerberos.kinit.cmd = /usr/bin/kinit
sasl.kerberos.min.time.before.relogin = 60000
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
sasl.kerberos.ticket.renew.window.factor = 0.8
sasl.mechanism = OAUTHBEARER
security.protocol = SASL_OAUTHBEARER
send.buffer.bytes = 131072
ssl.cipher.suites = null
ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
ssl.endpoint.identification.algorithm = null
ssl.key.password = null
ssl.keymanager.algorithm = SunX509
ssl.keystore.location = null
ssl.keystore.password = null
ssl.keystore.type = JKS
ssl.protocol = TLS
ssl.provider = null
ssl.secure.random.implementation = null
ssl.trustmanager.algorithm = PKIX
ssl.truststore.location = null
ssl.truststore.password = null
ssl.truststore.type = JKS
transaction.timeout.ms = 60000
transactional.id = null
value.serializer = class org.apache.kafka.common.serialization.StringSerializer
Also, for the JVM options, we are using:
KAFKA_HEAP_OPTS="-Xmx4G -Xms4G"
KAFKA_JVM_PERFORMANCE_OPTS="-server -XX:+UseCompressedOops -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:+CMSScavengeBeforeRemark -XX:+DisableExplicitGC -Djava.awt.headless=true"
Note that the CPU usage on these brokers usually sits around 3%, and the load is almost non-existent.
Solution
After spending quite a lot of time debugging this issue, it appears the delay was caused by a DNS lookup timeout. It seems Kafka version 1.1.1 does a hostname lookup for all new connections. Certain clients from certain data centers were connecting to our brokers. Kafka would timeout when doing a hostname lookup from these data centers that do not have DNS entries. The lookup would time out and cause a delay for all other connections on that thread.
Our solution was to reduce the timeout to 1 second as well as use the JVM option networkaddress.cache.negative.ttl
to cache failed lookups. See the following issue for more details: https://issues.apache.org/jira/browse/KAFKA-8562
This will for for us until we can update Kafka to a newer version. I understand that newer versions of Kafka do not have this issue as the DNS lookup was removed.
Answered By - jsmaupin Answer Checked By - Dawn Plyler (WPSolving Volunteer)