A Data Engineer's Interview guide to Apache Kafka

Introduction: The Modern Data Platform

Apache Kafka is a distributed event streaming platform that has become the central nervous system for modern data architectures. It's more than just a message queue; it's a durable, scalable, and fault-tolerant log of events that can power real-time data pipelines, event-driven microservices, and sophisticated analytics. This guide will walk you through the core components of Kafka, from the fundamental concepts to practical, hands-on examples, using a real-world retail platform as our running case study.

Chapter 1: Architectural Foundations - Cluster, Brokers, and Topics

The Kafka Cluster: Your Data Warehouse

A Kafka cluster is a distributed system made up of one or more servers, each called a broker.1 These brokers are the physical machines or instances that store and manage all of the data within your Kafka system.3 They are responsible for receiving new events from producers and serving events to consumers.3 To ensure high availability and prevent data loss, a production-grade cluster typically requires a minimum of three brokers to distribute data and its replicas.1

The First Architectural Decision: Dedicated vs. Shared Clusters

When designing your Kafka architecture, a key decision is whether to use a dedicated cluster for each application or a single, shared, multi-tenant cluster.6 This choice has significant implications for cost, operational overhead, and performance.

A dedicated cluster provides an isolated environment where all resources are allocated exclusively to one application or team.6 For example, a retail company might give the
Order Service its own cluster to ensure its mission-critical transactions are never affected by a "noisy neighbor" application.6 This model offers strong performance predictability and security isolation but comes with higher costs and more management overhead.6
A shared (multi-tenant) cluster uses a single Kafka infrastructure to serve multiple independent applications or teams.6 A large
Retail Stream Hub cluster could be shared by the Order Service, Fulfillment Service,Marketing, and Returns teams. This approach is more cost-effective and allows for faster onboarding of new applications since a cluster is already available.6 The main challenge is the "noisy neighbor" problem, where one resource-intensive workload can impact others.6 This requires robust governance, including strict resource quotas and Access Control Lists (ACLs) to ensure fairness and security.6

The choice between these two models depends on the organization's needs, budget, and operational maturity. A shared model requires a strong data governance framework to succeed.

Feature	Dedicated Cluster	Shared Cluster
Cost	Higher, as resources are provisioned for a single use case 6	Lower, due to efficient resource utilization across tenants 7
Management Overhead	High, requires dedicated operational effort per cluster 6	Lower, as a single team manages the core infrastructure 7
Performance Predictability	Strong, with physical or strong logical isolation 6	Variable, susceptible to the "noisy neighbor" problem 6
Agility	Slower, as provisioning a new cluster can be time-consuming 6	Faster onboarding for new applications 6

Topics: The Stream of Events

At the most basic level, a topic is a named, logical stream of events.3 You can think of a topic as a category or a feed to which producers publish records and from which consumers subscribe to retrieve them.9 For our retail platform, we would likely have separate topics for different types of events, such as

new_orders, customer_reviews, payment_transactions, and shipping_updates.11 This structure ensures a clear separation of data streams.3

Chapter 2: The Data Unit - Messages, Keys, and Headers

The Anatomy of a Kafka Message

Every piece of data in Kafka is called an event or a record.3 A Kafka message is a data record composed of a

key, a value (or payload), and optional headers.12

Key: The key is a binary string that, while optional, is crucial for maintaining message order and data grouping.14
Value (Payload): This is the core of the message, containing the actual business data.13 The value is also a binary string.
Headers: Headers are a flexible, key-value mechanism for attaching metadata to a message without changing the main payload.13 This is extremely useful for adding information like a
trace_id for distributed tracing, a message_version for schema management, or a source_application identifier.13 This decouples supplementary information from the business data, making your pipelines more flexible.13

Let's look at some retail examples to see how these components work together.

Example 1: OrderPlaced Event

Scenario: A customer places a new order on the website. The Order Service needs to publish this event.

Topic: new_orders

Message Key: order_id (e.g., "ORD-12345")

Message Value (JSON):

JSON

{
"orderId": "ORD-12345",
"customerId": "CUST-987",
"products": [
{"sku": "P-456", "quantity": 1},
{"sku": "P-789", "quantity": 2}
],
"totalPrice": 125.50,
"timestamp": "2024-07-25T10:00:00Z"
}

Headers:

{
"trace_id": "890c91d8-f9b2-4d57-9e4a-5f0c43c22a3d",
"source_app": "OrderService"
}

Example 2: ProductShipped Event

Scenario: An item from an order is shipped. The Fulfillment Service publishes an event for a single item.

Topic: products_shipped

Message Key: order_id (e.g., "ORD-12345")

Message Value (JSON):

JSON

{
"orderId": "ORD-12345",
"sku": "P-456",
"shipmentId": "SHIP-XYZ789",
"shippingCarrier": "FedEx",
"shippingTimestamp": "2024-07-25T14:30:00Z"
}

Headers:

{
"trace_id": "890c91d8-f9b2-4d57-9e4a-5f0c43c22a3d",
"source_app": "FulfillmentService"
}

The Power of the Key: Guiding Events to Partitions

Partitions are the backbone of Kafka's scalability and parallelism.4 A topic is divided into one or more partitions, and each partition is an ordered, immutable sequence of messages.4 When a producer sends a message, Kafka uses a

partitioning strategy to decide which partition the message should be written to.16

Key-based Partitioning: When a message has a key, Kafka uses a hash of the key to consistently map it to a specific partition.14 This is a critical feature because it guarantees that all messages with the same key will always be sent to the same partition and thus be processed in the correct order.14 For our
new_orders topic, using order_id as the key ensures that all events related to a single order (e.g., order creation, payment received, order shipped) are processed sequentially, which is vital for maintaining a consistent state.8
Round-Robin Partitioning (No Key): If a message is sent without a key, Kafka distributes it evenly across all available partitions in a round-robin fashion.16 This is ideal for stateless workloads where message order doesn't matter and the goal is to evenly distribute the processing load.17

Example of Partitioning in Action:

Imagine our new_orders topic has three partitions: P0, P1, and P2.

Producer 1 sends an event with key="ORD-12345". Kafka's hashing algorithm assigns this key to P1. All future events from any producer with key="ORD-12345" will also go to P1.8
Producer 2 sends an event with key="ORD-67890". Kafka assigns this key to P0. All future events with this key will go to P0.8
Producer 3 sends a series of clickstream events to a user_clicks topic, which does not require ordering. It sends these messages without a key, and Kafka distributes them evenly across all partitions (P0, P1, P2) in a round-robin manner to balance the load.17

Chapter 3: The Participants - Producers & Consumers

Producers: The Senders

Producers are the client applications that create and send event records to Kafka topics.18 In our retail example, the

Order Service application would be a producer, publishing a new OrderPlaced message whenever a customer completes a purchase. A key design principle of Kafka is the complete decoupling of producers and consumers.19 Producers can publish events at high velocity without needing to know which consumers are listening or how many there are.19

Consumers and Consumer Groups: The Receivers

Consumers are the applications that read and process events from Kafka topics.20 They subscribe to one or more topics to retrieve data.20

To enable parallel processing and fault tolerance, consumers work together in a consumer group.9 A consumer group is a collection of consumers that cooperate to consume data from a topic.20 The fundamental rule is that

each partition can only be consumed by one consumer within a given group.9 This provides a built-in load balancing mechanism.9

A powerful feature is Kafka's automatic rebalancing. If a new consumer joins a group, or an existing consumer fails, Kafka's internal group coordinator automatically reassigns the partitions to the remaining or new consumers.20 This ensures the workload is redistributed smoothly and the system remains highly available without manual intervention.20 The number of partitions in a topic determines the maximum number of active consumers in a group; if you have more consumers than partitions, some consumers will be idle.20

Chapter 4: Keeping Score - Volumetrics, Checkpoints, and Consumption

Checkpoints and Offsets: The Progress Tracker

An offset is a sequential ID number assigned to each message within a partition.9 It is the most critical piece of metadata for a consumer, as it tracks the consumer's position in the log.23 When a consumer reads a message, it progresses its offset, indicating that it has successfully processed that record.23 This committed offset is the "checkpoint" for a consumer's progress.23

The importance of offsets lies in how they enable fault tolerance and recovery. If a consumer instance fails, a new instance in the same group can take over its partitions and seamlessly resume consumption from the last committed offset, preventing message loss or duplication.20 These committed offsets are durably stored in a special, internal Kafka topic called

__consumer_offsets.21

Volumetrics and Capacity Planning

Kafka is designed to handle high volumes of data with consistent performance, regardless of the total data size.9 The key factor influencing storage cost is the topic's

log retention policy, which defines how long messages are retained before being automatically discarded.9 For a high-volume retailer, processing 1 million 1KB messages per minute would require approximately 1.4 TB of storage per day.24 Understanding this relationship between message size, throughput, and retention is crucial for capacity planning. Monitoring metrics like

KafkaLogsDiskUsed allows teams to track storage usage and prevent capacity issues.25

Chapter 5: The Data Contract - Schema Registry

In a distributed, event-driven architecture, ensuring all applications agree on the structure of data is a major challenge. Without a formal contract, a producer could change a field name, breaking every downstream consumer.26 This is where a

Schema Registry comes in.

A Schema Registry is a centralized service that stores and manages schemas for event data.23 It acts as a source of truth and a data governance tool, enforcing a contract between producers and consumers.28 By defining schemas in formats like Avro, JSON, or Protobuf, it ensures consistent message encoding and decoding across the ecosystem.27

One of the most powerful features of a Schema Registry is its support for schema evolution.26 It allows for changes to a schema over time while ensuring compatibility with existing consumers.27 For example, a

customer_returns event schema could be updated to include a new field like return_reason. As long as the change is configured to be "forwards-compatible," older consumers that don't know about the new field can still read the message without breaking.26

Chapter 6: The Safety Net - Failure Handling & Recovery

Distributed systems are prone to failures. The goal of a robust architecture is to handle these failures gracefully, ensuring data integrity and continuity. Kafka provides mechanisms to achieve this, including Exactly-Once Semantics (EOS).

Idempotent Producers: The idempotent producer feature, introduced in Kafka 0.11, guarantees that even if a producer retries sending the same message multiple times, Kafka will only write it to the log once.29 This is achieved by assigning a unique ID (PID) to the producer and a sequence number to each message, allowing the broker to detect and discard duplicates.31
Transactions: While idempotent producers prevent duplicates on a single partition, transactions extend this guarantee to a set of messages produced across multiple partitions.31 This is crucial for "consume-process-produce" workflows, where a downstream application reads from one topic, processes the data, and produces a new message to a different topic. Transactions ensure that either all messages in the transaction are successfully written or none are, providing atomicity.31 When combined with idempotent producers, this enables the processing and delivery of a message exactly once, even under failure.31
Consumer Failures: Consumer groups provide a robust, self-healing recovery mechanism.22 If a consumer instance crashes, Kafka's group coordinator will detect the failure and automatically reassign the failed consumer's partitions to other active members in the group.20 The new consumers will begin processing from the last committed offset, effectively "recovering" from the failure without any data loss or duplication.23 For unrecoverable messages (e.g., corrupted data), a common best practice is to send them to a
dead-letter queue, a separate topic for manual inspection and troubleshooting, rather than halting the entire processing pipeline.22

Chapter 7: The Full Picture - Event Reconciliation

In a microservices architecture, a single business entity, such as an order, can generate events across multiple decoupled services and topics. The challenge is to maintain a consistent view of the state of that order without a central database, which would introduce tight coupling and a single point of failure.32

Event reconciliation is an architectural pattern that solves this problem. It involves joining and aggregating events from multiple streams to derive a single, consolidated view of an entity's state.32 The

Kafka Streams library is an ideal tool for this pattern because it allows for stateful operations using local state stores that are backed by Kafka topics.32

Consider a practical example: the Shipping Service needs to know when an entire order is ready to ship, but the order, payment, and fulfillment events are all in separate topics. A Kafka Streams application within the Shipping Service can listen to topics like new_orders and products_manufactured. It uses a local state store to track the status of each order, updating the store as events arrive.32 Once all products for an order are marked as manufactured, the application produces a single, enriched

shipping_ready event to a new topic, triggering the final shipping process.32 This makes the state of the order a derived result of the event stream, making the system inherently resilient and decoupled.32

Chapter 8: The Ecosystem - The Right Tool for the Job

Kafka vs. Flume: The Specialist vs. The Generalist

Apache Flume is a specialized, distributed system designed primarily for collecting large volumes of log data from multiple sources and moving it into a centralized data store, typically HDFS.33 It operates on a

push model, where sources push data to Flume agents.33 Flume is an excellent choice for simple, point-to-point data pipelines.33

Apache Kafka, on the other hand, is a general-purpose, distributed streaming platform that operates on a pull model, where consumers pull data from brokers.33 It is a central, highly scalable hub for multiple producers and consumers.33 Unlike Flume, Kafka's consumer groups make it simple to add new consumers without changing the pipeline topology.35 Use Flume for simple, specialized log collection and use Kafka as the foundation for a scalable, real-time data pipeline that supports a multitude of applications.34

Dimension	Apache Kafka	Apache Flume
Primary Use Case	General-purpose stream processing, real-time pipelines 34	Specialized log collection to a centralized store 33
Data Flow	Pull Model (consumers pull data) 33	Push Model (sources push data) 33
Scalability	Horizontally scalable for a large number of consumers and applications 35	Scalability is limited for a large number of consumers 35
Fault Tolerance	Highly fault-tolerant with data replication across brokers 34	Less resilient; event loss can occur if an agent crashes 33

Kafka Streams vs. Apache Flink: The Library vs. The Framework

The choice between Kafka Streams and Apache Flink depends on the specific use case and team expertise.

Kafka Streams: This is a lightweight, embeddable Java library that runs directly inside a standard application.36 It does not require a separate cluster and uses Kafka's consumer group protocol for fault tolerance and parallelism.38 This model is ideal for building lightweight, reactive microservices and real-time applications where the stream processing logic is tightly integrated with the application.36 It has lower operational overhead and can be managed by a standard application development team.36
Apache Flink: Flink is a full-featured distributed stream processing framework that runs on its own cluster, managed by a resource manager like YARN or Kubernetes.38 It is designed for heavy-duty, large-scale stream analytics and complex event-time processing.38 Flink offers a rich set of features, including support for both bounded and unbounded streams and a powerful SQL API.36 Flink is often managed by a dedicated infrastructure team and is a better fit for complex, data-centric workloads that require a dedicated, high-performance cluster.36

Dimension	Kafka Streams	Apache Flink
Deployment Model	Embedded library within a Java application 36	Standalone cluster managed by a resource manager 38
Ideal Use Case	Microservices, real-time applications 36	Large-scale stream analytics, complex aggregations 39
Operational Complexity	Lower, managed with existing application tools 36	Higher, requires a dedicated infrastructure team 36
Data Sources	Primarily Kafka topics 36	Multiple sources including Kafka, files, databases 36

Chapter 9: Building the Pipeline - Hands-On Code Examples

Integrating Kafka with a stream processing framework like Apache Spark is a common pattern for building robust data pipelines. Spark Structured Streaming provides a fault-tolerant, scalable, and easy-to-use API for processing data from Kafka topics.41

Reading from Kafka with PySpark

This PySpark example demonstrates how to set up a streaming read from a Kafka topic. It configures the Kafka brokers and specifies startingOffsets as latest, which ensures the stream begins processing new messages as they arrive.42

Python

import pyspark.sql.functions as F
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, ArrayType, IntegerType

# Initialize Spark Session
spark = SparkSession.builder \
.appName("KafkaStructuredStreamingReader") \
.config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0") \
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

# Define the schema for our incoming order data
schema = StructType()), True),
StructField("totalPrice", DoubleType(), True),
StructField("timestamp", StringType(), True)
])

# Read data from Kafka topic in a streaming fashion
# Note: In a production environment, 'host1:port1,host2:port2' would be your Kafka broker list
df_kafka_stream = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "new_orders") \
.option("startingOffsets", "latest") \
.load()

# Deserialize the value column from binary to a string and then parse the JSON
df_parsed = df_kafka_stream.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") \
.withColumn("value", F.from_json(F.col("value"), schema)) \
.select(F.col("key"), F.col("value.*"))

# Start the streaming query and write the output to the console
# In a real-world scenario, you would write this to a persistent sink (e.g., a data warehouse, another Kafka topic)
query = df_parsed.writeStream \
.outputMode("append") \
.format("console") \
.option("truncate", "false") \
.start()

query.awaitTermination()

Writing to Kafka with PySpark

This example shows how to write a streaming DataFrame back to another Kafka topic, for instance, after processing it. This is a common pattern in stream processing pipelines where data is transformed and then routed to a new destination.42

Python

from pyspark.sql import SparkSession
from pyspark.sql.functions import to_json, struct, col

# Initialize Spark Session
spark = SparkSession.builder \
.appName("KafkaStructuredStreamingWriter") \
.config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0") \
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

# Create a sample DataFrame to simulate processed data
# In a real application, this would be the output of a streaming query
df_processed = spark.createDataFrame(, ["key", "value_payload"])

# Write the DataFrame back to a Kafka topic
# The to_json and struct functions are used to prepare the data in the required format
query = df_processed.select(to_json(struct(col("key"))).alias("key"), col("value_payload").alias("value")) \
.writeStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("topic", "order_status_updates") \
.option("checkpointLocation", "/tmp/spark/checkpoints/order_updates") \
.start()

query.awaitTermination()

Reading from Kafka with Scala

The Scala code below shows the equivalent process for reading from a Kafka topic. It follows the same logic, configuring the Spark session to connect to Kafka and read from the new_orders topic.41

Scala

import org.apache.spark.sql.{SparkSession, Dataset, Row}
import org.apache.spark.sql.functions.{col, from_json, to_json, struct}
import org.apache.spark.sql.streaming.{StreamingQuery, Trigger}
import org.apache.spark.sql.types._

object KafkaNewOrdersReader {
def main(args: Array): Unit = {
val spark = SparkSession.builder()
.appName("KafkaNewOrdersReader")
.master("local[*]")
.config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

// Define the schema for the incoming JSON data
val schema = StructType(Seq(
StructField("orderId", StringType, true),
StructField("customerId", StringType, true),
StructField("products", ArrayType(StructType(Seq(
StructField("sku", StringType, true),
StructField("quantity", IntegerType, true)
))), true),
StructField("totalPrice", DoubleType, true),
StructField("timestamp", StringType, true)
))

// Read from the Kafka topic using Structured Streaming
val df_kafka_stream = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "new_orders")
.option("startingOffsets", "latest")
.load()

// Cast the binary key and value to String and parse the JSON
val df_parsed = df_kafka_stream
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.withColumn("value", from_json(col("value"), schema))
.select(col("key"), col("value.*"))

// Write the output to the console for demonstration
val query: StreamingQuery = df_parsed.writeStream
.outputMode("append")
.format("console")
.option("truncate", "false")
.start()

query.awaitTermination()
}
}

Writing to Kafka with Scala

This Scala example demonstrates writing a streaming DataFrame to a new Kafka topic, for example, a topic that aggregates and enriches order information for a fulfillment team.41

Scala

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming.StreamingQuery
import org.apache.spark.sql.types._

object KafkaStreamWriter {
def main(args: Array): Unit = {
val spark = SparkSession.builder()
.appName("KafkaStreamWriter")
.master("local[*]")
.config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

// Create a dummy DataFrame to simulate a stream of processed data
// with columns 'key' and 'value_payload'
val data = Seq(("ORD-12345", "Order processed and ready for fulfillment."), ("ORD-67890", "Order is in transit."))
val df_processed = spark.createDataFrame(data).toDF("key", "value_payload")

// In a real application, this would be a streaming DataFrame from a source
val streamingDf = spark.readStream
.format("rate")
.option("rowsPerSecond", 1)
.load()
.withColumn("key", concat(lit("order-"), col("value")))
.withColumn("value_payload", concat(lit("Status for order-"), col("value")))
.selectExpr("CAST(key AS STRING)", "CAST(value_payload AS STRING) AS value")

// Write the stream to a Kafka topic
val query = streamingDf.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "order_status_updates")
.option("checkpointLocation", "/tmp/spark/checkpoints/order_updates") // Crucial for fault-tolerance
.start()

query.awaitTermination()
}
}

The use of checkpointLocation is crucial for fault tolerance in Spark Structured Streaming. It stores the state of the query and the offsets of the last processed Kafka messages, allowing the stream to resume from the exact point of failure without reprocessing data.43

Works cited

Kafka Architecture - GeeksforGeeks, accessed September 7, 2025, https://www.geeksforgeeks.org/apache-kafka/kafka-architecture/
Kafka Logging Guide: The Basics - CrowdStrike, accessed September 7, 2025, https://www.crowdstrike.com/en-us/guides/kafka-logging/
Intro to Apache Kafka®: Tutorials, Explainer Videos & More - Confluent Developer, accessed September 7, 2025, https://developer.confluent.io/what-is-apache-kafka/
Kafka Partitions: Essential Concepts for Scalability and Performance - DataCamp, accessed September 7, 2025, https://www.datacamp.com/tutorial/kafka-partitions
Deploying and scaling Apache Kafka on Amazon EKS | Containers, accessed September 7, 2025, https://aws.amazon.com/blogs/containers/deploying-and-scaling-apache-kafka-on-amazon-eks/
Dedicated Kafka Cluster vs. Shared Kafka Cluster - AutoMQ, accessed September 7, 2025, https://www.automq.com/blog/dedicated-kafka-cluster-vs-shared-kafka-cluster
APACHE KAFKA: Multi-Tenancy Overview - Orchestra, accessed September 7, 2025, https://www.getorchestra.io/guides/apache-kafka-multi-tenancy-overview
Kafka Message Key: A Comprehensive Guide - Confluent, accessed September 7, 2025, https://www.confluent.io/learn/kafka-message-key/
log.retention.hours - Apache Kafka, accessed September 7, 2025, https://kafka.apache.org/08/documentation.html
Apache Kafka for Beginners: A Comprehensive Guide - DataCamp, accessed September 7, 2025, https://www.datacamp.com/tutorial/apache-kafka-for-beginners-a-comprehensive-guide
Apache Kafka Use Cases: When To Use It? When Not To? | Upsolver, accessed September 7, 2025, https://www.upsolver.com/blog/apache-kafka-use-cases-when-to-use-not
What is Kafka? - Apache Kafka Explained - AWS - Updated 2025, accessed September 7, 2025, https://aws.amazon.com/what-is/apache-kafka/
Kafka Headers: Concept & Best Practices & Examples - AutoMQ, accessed September 7, 2025, https://www.automq.com/blog/kafka-headers-concept-best-practices-examples
Apache Kafka Partition Key: A Comprehensive Guide - Confluent, accessed September 7, 2025, https://www.confluent.io/learn/kafka-partition-key/
Using custom Kafka headers for advanced message processing - Tinybird, accessed September 7, 2025, https://www.tinybird.co/blog-posts/using-custom-kafka-headers
Apache Kafka Partition Strategy: Optimizing Data Streaming at Scale - Confluent, accessed September 7, 2025, https://www.confluent.io/learn/kafka-partition-strategy/
Kafka topic partitioning strategies and best practices - New Relic, accessed September 7, 2025, https://newrelic.com/blog/best-practices/effective-strategies-kafka-topic-partitioning
www.statsig.com, accessed September 7, 2025, https://www.statsig.com/perspectives/kafka-consumers-producers#:~:text=Introduction%20to%20Kafka%20producers%20and%20consumers&text=Producers%20are%20the%20applications%20or,read%20data%20from%20Kafka%20topics.
Documentation - Apache Kafka, accessed September 7, 2025, https://kafka.apache.org/documentation/
A Beginner's Guide to Kafka® Consumers - Instaclustr, accessed September 7, 2025, https://www.instaclustr.com/blog/a-beginners-guide-to-kafka-consumers/
Kafka Consumer for Confluent Platform, accessed September 7, 2025, https://docs.confluent.io/platform/current/clients/consumer.html
Managing Kafka Consumer Errors Strategies for Retry and Recovery - Scaler Topics, accessed September 7, 2025, https://www.scaler.com/topics/kafka-tutorial/kafka-consumer-error-handling-retry-and-recovery/
Kafka Consumer Offsets Guide—Basic Principles, Insights ..., accessed September 7, 2025, https://www.confluent.io/blog/guide-to-consumer-offsets/
Kafka throughput—Trade-offs, solutions and alternatives - Redpanda, accessed September 7, 2025, https://www.redpanda.com/guides/kafka-alternatives-kafka-throughput
Storage calculation for kafka | AWS re:Post, accessed September 7, 2025, https://repost.aws/questions/QUFfRTaccwT9KmXtEbOfh2MQ/storage-calculation-for-kafka
Schema registry overview | Google Cloud Managed Service for Apache Kafka, accessed September 7, 2025, https://cloud.google.com/managed-service-for-apache-kafka/docs/schema-registry/schema-registry-overview
Comprehensive Guide to Kafka Schema Registry - RisingWave, accessed September 7, 2025, https://risingwave.com/blog/comprehensive-guide-to-kafka-schema-registry/
Schema Registry For Data Governance - Meegle, accessed September 7, 2025, https://www.meegle.com/en_us/topics/schema-registry/schema-registry-for-data-governance
Exactly-once Semantics is Possible: Here's How Apache Kafka Does it, accessed September 7, 2025, https://www.confluent.io/blog/exactly-once-semantics-are-possible-heres-how-apache-kafka-does-it/
Kafka Exactly Once Semantics Implementation: Idempotence and Transactional Messages, accessed September 7, 2025, https://medium.com/@AutoMQ/kafka-exactly-once-semantics-implementation-idempotence-and-transactional-messages-3c2168603d2b
What is Kafka Exactly Once Semantics - GitHub, accessed September 7, 2025, https://github.com/AutoMQ/automq/wiki/What-is-Kafka-Exactly-Once-Semantics
Reconcile and aggregate events using Kafka streams | Devoteam, accessed September 7, 2025, https://www.devoteam.com/expert-view/reconcile-and-aggregate-events-using-kafka-streams/
Difference between Apache Kafka and Flume - Tutorialspoint, accessed September 7, 2025, https://www.tutorialspoint.com/difference-between-apache-kafka-and-flume
Apache Kafka vs Flume | Top 5 Awesome Comparison To Know - EDUCBA, accessed September 7, 2025, https://www.educba.com/apache-kafka-vs-flume/
How can one know when to use Apache flume and when to use Apache Kafka? - Quora, accessed September 7, 2025, https://www.quora.com/How-can-one-know-when-to-use-Apache-flume-and-when-to-use-Apache-Kafka
Kafka Streams vs. Apache Flink - OpenLogic, accessed September 7, 2025, https://www.openlogic.com/blog/apache-flink-vs-kafka-streams
Architecture - Apache Kafka, accessed September 7, 2025, https://kafka.apache.org/11/documentation/streams/architecture
Flink vs Kafka Streams: A Complete Comparison - Confluent, accessed September 7, 2025, https://www.confluent.io/blog/apache-flink-apache-kafka-streams-comparison-guideline-users/
Flink vs. Kafka: A Quick Guide to Stream Processing Engines | by AnalytixLabs | Medium, accessed September 7, 2025, https://medium.com/@byanalytixlabs/flink-vs-kafka-a-quick-guide-to-stream-processing-engines-b09dd0e6b8af
First-Time Kafka-Flink Integration: Stream Processing Insights | by Mitchell Gray | Medium, accessed September 7, 2025, https://medium.com/@mitch_datorios/first-time-kafka-flink-integration-stream-processing-insights-b55e0a4858dd
Structured Streaming + Kafka Integration Guide (Kafka broker version 0.10.0 or higher), accessed September 7, 2025, https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
Stream processing with Apache Kafka and Databricks | Databricks ..., accessed September 7, 2025, https://docs.databricks.com/aws/en/connect/streaming/kafka
Structured Streaming checkpoints - Azure Databricks | Microsoft Learn, accessed September 7, 2025, https://learn.microsoft.com/en-us/azure/databricks/structured-streaming/checkpoints

Datagaru

Search This Blog