07. KAFKA STREAMING PIPELINE

Live comments in, trending entities out.

Reddit comments flow through Kafka into Spark Structured Streaming, which extracts named entities with NLTK and counts them. Logstash then indexes the snapshots into Elasticsearch, and Kibana sits on top for the dashboards. The whole stack is containerized with Docker Compose, so one command brings the broker, the stream job, and the search layer up together.

Architecture

The flow is a straight line. Live comments enter Kafka, Spark reads and transforms the stream into entity counts, Logstash ships those snapshots into Elasticsearch, and Kibana renders them.

pipeline architecture, reddit through kafka and spark to elasticsearch and kibana
reddit into kafka into spark into logstash into elasticsearch into kibana.

The original one hour run

Before the on demand version, the pipeline ran continuously for an hour against live Reddit. These Kibana snapshots at fifteen, thirty, forty five, and sixty minutes show the entity counts climbing as more comments streamed in.

kibana dashboard fifteen minutes into the run
fifteen minutes in.
kibana dashboard thirty minutes into the run
thirty minutes in.
kibana dashboard forty five minutes into the run
forty five minutes in.
kibana dashboard sixty minutes into the run
sixty minutes in.

Run it yourself

Clicking run boots the entire stack on a fresh GitHub Actions runner. It streams live comments for about a minute, extracts and counts the entities, publishes the snapshot back to this page, and then everything shuts down. Nothing stays running between visits, so the results below always come from the most recent run someone triggered.

run the pipeline

a click boots the real stack on github, streams for about a minute, then shuts it all down.

latest run

loading the latest run.

The stack

PieceWhat it does
Kafkathe message broker that buffers the incoming comment stream
Spark Structured Streamingreads the stream, extracts named entities with nltk, and counts them
Logstashships the entity snapshots into elasticsearch
Elasticsearchstores and aggregates the entity counts
Kibanathe dashboard that visualizes the trending entities
Docker Composebrings the whole stack up with a single command
prawthe reddit api client that pulls the live comments