Zoom Data Engineering Part 6

1 minute read

What is Kafkas

Apache Kafka is used to handle (ingesting, moving, and consuming) real-time data feeds. It’s apart of Apache and is know open-source stream processing platform which aims to provide a high-throughput, low-latency & fault-tolerant platform which is capable of handling real-time data input.

Fun Fact: Kafka was developed at LinkedIn (my least faorite website) and named after the author Kafka because it is “a system optimized for writing”, and he liked Kafka’s work.[6]

In a data project we can differentiate between consumers and producers:

Consumers are those that consume the data: web pages, micro services, apps, etc.

Producers are those who supply the data to consumers.

Connecting consumers to producers directly can lead to an unstructured to maintain architecture in complex projects. Kafka solves this issue by becoming an intermediary that all other components connect to.

Kafka works by allowing producers to send messages which are then pushed in real time by Kafka to consumers.

Basic Kafka Components

  • Keywords for Kafka

What is a Message in Kafka

Messages are the basic communication abstraction used by producers and consumers in order to share information. Message have 3 main components:

  • Key: Used to idenfiy the message and for additional kafka stuff such as partitions
  • Value: The actual information that producers push and consumers are intrested in
  • Timestamp: Used for Logging

Kafka stores key-value messages that come from arbitrarily many processes called producers

Broker and Cluster

A Kafka Broker is a machine (physical or virtualized) on which Kafka is running.

A Kafka Cluster is a collection of brokers (nodes) working together.

Configuring Kafka

Install on MacOS:

brew install kafka

More on this later….

Similar to RabbitMQ