Pharmacy ETL: from Java collector to Redshift and Power BI
TL;DR - Pipeline pharmacies → S3 Parquet → Python/Iceberg → Redshift; Flutter app, Java API and Power BI. The real challenges: volume, consistency across layers and operations (monitoring and reprocessing).
I designed and ran a data pipeline that starts at the pharmacies, goes through a datalake on S3 with Apache Iceberg and lands in Redshift. The Java collector reached over 5,000 stores (target was 16,000) over about a year. The overall flow is in the diagram below.
flowchart TB
Farmacia[Pharmacies] --> Coletor[Java Collector]
Coletor --> Gateway[Java Gateway]
Gateway --> S3[S3 Parquet/bronze]
S3 --> Python[Python Pipeline]
Python --> Iceberg[Iceberg silver/gold]
Iceberg --> Redshift[Redshift]
Redshift --> App[Flutter App iOS/Android]
Redshift --> API[Java API]
Redshift --> PBI[Power BI]
From store to S3 (bronze)
In the stores, a Java collector runs and sends only the tables that matter: sales, inventory and the like. The destination is a gateway, also in Java, that validates the basics and writes Parquet to S3, partitioned by date and source.
In practice this produced about 100 thousand Parquet files per cycle (up to a thousand rows per file) and around 6 GB/hour ingested and processed. Partitioning by date and source was essential to avoid killing the pipeline on reads and to reprocess only what was needed.
Bronze → Iceberg → Redshift (silver and gold)
A Python pipeline reads those Parquet files, normalizes and loads into Iceberg. I split into silver (clean, conformant data) and gold (aggregates and views ready for consumption). From Iceberg I copy into Redshift for analytical queries.
Who consumes at the end: a Flutter app (iOS and Android) with near real-time data (inventory, sales, rep performance, margin, regional comparison), a Java API for partners and Power BI for analysts’ reports.
Where it hurt (and what helped)
Volume. Many pharmacies, many events. Proper partitioning in S3 and Iceberg reduced read cost and allowed reprocessing by partition without rescanning everything.
Consistency across layers. Keeping bronze, silver and gold aligned without blocking the pipeline requires clear rules for schema, naming and (when possible) contracts. When something breaks, the issue often shows up in the consumption layer; well-defined partitions help isolate and fix.
Governance. Who can see what in Redshift and Power BI, and how to document sources, became a constant pain. Defining ownership per set of tables and keeping a minimal glossary (source, refresh, intended use) saves time later.
Operations. The biggest bottleneck was monitoring failures in the collector or gateway and reprocessing when needed. Investing in alerts by partition/window and idempotent reprocessing jobs pays off quickly.
Summary
- Bronze on S3 in Parquet: partition by date and source; small files (up to ~1k rows) make granular reprocessing easier.
- Silver/gold in Iceberg: separating “clean data” from “aggregate ready for consumption” makes Redshift and Power BI simpler and more predictable.
- Redshift as consumption layer: app, API and BI read from the same source; the heavy lifting stays in the datalake.
- Governance and operations: defining who owns what and how to reprocess (and monitor) avoids surprises when volume grows.
If you build something similar, set aside time from the start for partitioning, idempotent reprocessing and basic governance. The rest scales better once that’s in place.