Zoom Data Engineering Part 6
What is Kafkas
Apache Kafka is used to handle (ingesting, moving, and consuming) real-time data feeds. It’s apart of Apache and is know open-source stream processing platform which aims to provide a high-throughput, low-latency & fault-tolerant platform which is capable of handling real-time data input.
Fun Fact: Kafka was developed at LinkedIn (my least faorite website) and named after the author Kafka because it is “a system optimized for writing”, and he liked Kafka’s work.[6]
In a data project we can differentiate between consumers and producers:
Consumers are those that consume the data: web pages, micro services, apps, etc.
Producers are those who supply the data to consumers.
Connecting consumers to producers directly can lead to an unstructured to maintain architecture in complex projects. Kafka solves this issue by becoming an intermediary that all other components connect to.
Kafka works by allowing producers to send messages which are then pushed in real time by Kafka to consumers.
Basic Kafka Components
- Keywords for Kafka
What is a Message in Kafka
Messages are the basic communication abstraction used by producers and consumers in order to share information. Message have 3 main components:
- Key: Used to idenfiy the message and for additional kafka stuff such as partitions
- Value: The actual information that producers push and consumers are intrested in
- Timestamp: Used for Logging
Kafka stores key-value messages that come from arbitrarily many processes called producers
Broker and Cluster
A Kafka Broker is a machine (physical or virtualized) on which Kafka is running.
A Kafka Cluster is a collection of brokers (nodes) working together.
Configuring Kafka
Install on MacOS:
brew install kafka
Similar to RabbitMQ