designing instagram's hashtag page and the "newly unread" indicator
two systems that look deceptively simple on the surface but hide a bunch of interesting engineering decisions underneath. one is the hashtag page you see on instagram — you know, when you tap #sunset and see the name, total posts, and a grid of top photos. the other is that tiny number on your messaging icon that tells you how many new people messaged you. let's walk through both.
part 1: the hashtag service
the requirement
imagine you're an early engineer at instagram and someone walks up to you and says — "hey, we need to build this page." for every hashtag, all you have to show is:
- the name of the hashtag
- total number of posts (approximate is fine)
- top 100 photos tagged with it
the top 100 photos? that's computed by a data science team. you don't care about their logic — exponential decay, reaction counts in the last hour, whatever. they hand you a list of 100 post IDs. your job is to render the page. sounds simple right? well, the moment you dig in, you realize there's so much more.
design principle: best user experience
before we start, here's a discipline that most engineers skip — define your guiding principle. for this system, it's best user experience. every single design decision should optimize for UX. whenever you're stuck between option A and option B, you pick the one that makes the user's experience better.
this is honestly one of the best habits you can build. without a guiding principle, you end up in the classic trap where someone says "i want strong consistency, high availability, AND fault tolerance" and you're basically violating the CAP theorem. you can't have it all. but what you're okay giving up on should depend on what you're optimizing for.
the read path
so the user taps #sunset. a GET /tags/{tag_name} request hits your hashtag API servers. and because we're optimizing for UX, this one API call should return everything — the name, the count, and the top 100 photos. no fan-out from the frontend making three separate calls. one request, one response, done.
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Client │ │ Hashtag API │ │ Hashtag DB │
│ │────>│ │────>│ Partitioned KV │
│ GET /tags/sun │<────│ Returns full │<────│ (Mongo/Dynamo) │
│ │ │ JSON response │ │ │
└─────────────────┘ └─────────────────┘ └─────────────────┘
the hashtag db is a partitioned key-value store — could be mongodb, dynamodb, whatever. but why partitioned? because our access pattern is dead simple — given a hashtag name, fetch its document. that's a key-value lookup. the hashtag name (like sunset) is the key, the document with count and top photos is the value. no range queries, no joins, no fancy relational stuff. just get(key) → value.
and because there are millions of hashtags, one machine can't hold all of them. you need to spread the data across multiple nodes. the hashtag name is a natural partition key — it distributes well (hashtags are diverse enough), and all the data you need for one hashtag lives in a single document on a single node. no cross-partition queries needed. so the three requirements from storage are: partitioned, key-value access, and durable. mongodb, dynamodb, couchbase — any of these work. pick based on team expertise and operational comfort.
the document looks something like this:
{
"tag": "sunset",
"total_posts": 1200000,
"top_100": [ ... ]
}
now the interesting question — what goes inside top_100?
storing top 100: the storage tradeoff
you have three options here and this is where things get spicy. let's do the math for each so we make an informed decision instead of going with gut feel.
option 1: store only post IDs. so top_100 is just [p1, p2, p3, ..., p100]. let's break down the document size:
tag name: ~12 bytes (avg hashtag length)
total_posts: ~32 bytes (number serialized as string + key overhead)
top_100 array: 100 × 8 bytes (snowflake IDs are 64-bit = 8 bytes each)
= 800 bytes
─────────────────────────────────────
total: ~850 bytes ≈ 1 KB
cheap on storage. but now when a request comes in, you read this document, get the 100 IDs, and then you have to do a batch read from the posts database to fetch the actual post details — image URLs, captions, whatever you need to render the grid. that's a second lookup. more latency. bad for UX.
option 2: store entire post metadata. so you denormalize everything — caption, image URL, user details, tags, the works. per post:
post ID: 8 bytes (snowflake)
caption: ~160 bytes (avg caption with hashtags baked in,
think twitter-length text)
image URL: ~160 bytes (CDN URLs can be long)
user metadata: ~160 bytes (username, profile pic URL, etc.)
─────────────────────────────────────
per post: ~488 bytes, but captions contain hashtags
which eat space, so realistically ~560-660 bytes
let's call it ~1 KB conservatively
top_100 array: 100 × ~1 KB = ~100 KB per document
that's ~100 KB read from disk on every lookup. and here's the nastier problem — if you store likes count here, every time a like happens, you have to update it in the posts db AND here. you just introduced a consistency nightmare and extra plumbing to keep things in sync.
option 3: the middle ground. this is where you put on the product manager hat. you ask yourself — what's the bare minimum we need to render the grid? just the photo. that's it. the user sees a grid of images. they tap one, they go to the full post page. no caption, no likes, no username on the grid view. so all you store is the post ID and the image URL:
post ID: 8 bytes (snowflake)
image URL: ~160 bytes
─────────────────────────────────────
per post: ~168 bytes
top_100 array: 100 × 168 = ~16,800 bytes ≈ 16 KB per document
from 100 KB down to ~16 KB. peanuts. no extra consistency headaches. no syncing likes. no additional lookups. dead simple.
and here's the key insight — you as an engineering leader can propose product changes that simplify the system. most people think this is not possible. it absolutely is. if your idea has merit and optimizes the system, your PMs will get it. it's not product versus engineering — it's product AND engineering towards the same goal. the answer usually lies in the gray area.
why pagination is an overkill here
now someone will inevitably say "shouldn't we paginate the top 100?" and at first glance it sounds reasonable. but think about how the data is stored. you have one JSON document with an array of 100 items. if you paginate with page=1&size=10, what happens? the database still reads the entire document off disk, then discards 90 entries and returns 10. that's a waste of disk I/O. you're not saving anything.
pagination makes sense when you have separate rows — like SELECT * FROM posts LIMIT 10 OFFSET 20. here, you have one document containing an array. pagination is just unnecessary overhead.
so we send all 100 post metadata (ID + image URL) to the frontend in one shot. but wait — what if the user never scrolls past the first 18 images (roughly 3 folds on a phone)? then we just wasted bandwidth loading 82 images for nothing.
lazy loading to the rescue
the trick is to offload this responsibility to the frontend. send all 100 metadata entries, but let the frontend do lazy loading of images. only fetch the actual image file when it enters the viewport. if the user never scrolls, those 82 images are never downloaded.
┌─────────────────────────────────────────┐
│ Frontend │
│ │
│ Receives: 100 × {id, image_url} │
│ │
│ Renders: Only images in viewport │
│ (lazy loading) │
│ │
│ User sees 3 folds = ~18 images loaded │
│ Other 82 = never fetched │
└─────────────────────────────────────────┘
this is a good example of how it's not backend vs frontend — it's backend AND frontend together solving the problem. rather than over-engineering pagination on the backend, you push the intelligence to where it belongs. efficient bandwidth usage, no database overhead, great UX.
the write path: counting hashtag posts
now we've sorted out reads. let's talk writes. every time a post is published, we need to increment the total_posts count for each hashtag in that post's caption. but before we jump to the architecture, let's build up to it.
here's what already exists in your infra. you have a post service (svc is just shorthand for service — you'll see it everywhere in system design diagrams) that handles post creation. users upload a photo, write a caption, hit publish, and the post gets stored in the posts db. that's your existing system. now you're bolting the hashtag service on top of it.
so the question is — when a post gets published, how does the hashtag service know about it? the naive thought is "just call the hashtag service directly from the post service." but think about who else cares when a post is published:
- the feed service needs to add it to followers' feeds
- the notification service needs to notify followers
- the search service needs to index it in elasticsearch
- the hashtag service needs to update counts
if the post service directly calls each of these, you've tightly coupled everything. post service now needs to know about every downstream consumer. add a new consumer? modify the post service. one downstream is slow? post service slows down. one downstream is down? post service either fails or needs retry logic for each. that's a mess.
this is the classic case for event-driven architecture with kafka as the glue. the post service doesn't care who's listening. it just publishes a POST_PUBLISH event to a kafka topic and moves on. whoever is interested — feed, notifications, search, hashtag — independently consumes from that same topic. decoupled. extensible. if tomorrow you want to add a "trending" service, just add another consumer. zero changes to the post service.
┌──────────┐ ┌──────────────────────────────┐
│ │ │ Kafka │
│ Post Svc │────>│ (POST_PUBLISH topic) │
│ │ │ partitioned by post_id │
└──────────┘ └──────┬───┬───┬───┬───────────┘
│ │ │ │
┌─────────┘ │ │ └──────────┐
v v v v
┌──────────┐ ┌─────┐ ┌──────┐ ┌──────────────┐ ┌────────────┐
│ Feed │ │Notif│ │Search│ │ Hashtag Wrkr │────>│ Hashtag DB │
│ Service │ │ Svc │ │ Svc │ │ │ │ │
└──────────┘ └─────┘ └──────┘ │ extract tags │ │ INCR count │
│ update count │ └────────────┘
└──────────────┘
so kafka is not something we added for fun — it naturally falls out of the requirement that multiple services need to react to a post being published. the hashtag worker (wrkr = worker, another shorthand) is just one of many consumers on that topic.
now the hashtag worker receives the event, gets the post ID, fetches the caption from the post service (or the caption is included in the kafka event itself), and extracts hashtags from it using a simple regex. the naive implementation looks like this:
def process(m):
tags = extract_hashtags(m.caption)
for tag in tags:
db.incr(tag, 1)
simple. one db call per tag. but let's estimate the scale. if instagram sees 100K posts per minute, and each post has ~8 hashtags on average, that's 800K database updates per minute just for this one use case. one db call per tag is not gonna cut it.
approach 2: in-memory batching
so we buffer. instead of writing every increment to the db immediately, we accumulate counts in an in-memory map and flush periodically.
def process(m):
tags = extract_hashtags(m.caption)
for tag in tags:
m[tag] += 1
if has_been_long(): # > 5 minutes or count > 1000
for tag, count in m:
db.incr(tag, count)
m.clear()
now instead of db.incr(tag, 1) eight hundred thousand times, you do db.incr(tag, count) where count could be hundreds or thousands. you've slashed the db calls by a massive factor.
but should you trigger the flush on time or frequency? time-based (every 5 minutes) has a risk — what if a surge of posts comes in and your map fills up so much that you run out of memory and the process crashes? frequency-based (every 1000 messages) is safer in that regard. pick based on your appetite for risk.
one small but crucial thing — write the buffer to disk instead of purely in-memory. if the worker crashes, in-memory data is gone. an embedded db like rocksdb or leveldb that supports incr operations gives you durability without much complexity.
the stop-the-world problem
here's where it gets really interesting. when you flush the buffer, you're iterating through the map and making sequential db calls. during this time, you've stopped consuming from kafka. if the flush takes 5 seconds (say 1000 tags × 5ms per call), that's 5 seconds of zero consumption. bad.
so you think — let me do this in a separate thread. but now the map is shared between two threads: the consumer thread (writing to the map) and the flusher thread (reading from it). if you try to iterate and modify the map simultaneously, you get concurrent modification exceptions. if you take a lock, you're back to stopping the world.
deep copy approach: take a lock, deep copy the map, clear the original, release the lock, then flush the copy in a thread. the critical section is now just the deep copy + clear, not the entire flush. much better. but deep copy takes time and doubles your memory.
the two-buffer swap — minimal stop the world:
this is the elegant solution. think of it like a mcdonald's coke fountain competition — you have two glasses. while you're drinking from one, you fill the other. apply this to buffers:
┌──────────────┐ ┌──────────────┐
│ Buffer A │ │ Buffer B │
│ (active - │ │ (passive - │
│ accepting │ │ being │
│ writes) │ │ flushed │
└──────────────┘ │ to DB) │
└──────────────┘
you always write to the active buffer. when it's time to flush, you swap references — ma, mp = mp, ma. that's it. three CPU instructions. your critical section is literally a variable swap. no deep copy, no doubled memory, no long lock times.
if count == 1000:
mu.lock()
ma, mp = mp, ma
mu.unlock()
go writeToDB(mp)
the flusher thread works on the passive buffer at its own pace while the consumer keeps writing to the now-active buffer. when the flush is done, swap again. this is the same pattern used in production systems handling petabytes of data — google's dataproc remote shuffle service used exactly this to avoid stopping consumption while writing to remote storage.
approach 3: repartitioning by hashtag
one more thing. the kafka topic POST_PUBLISH is partitioned by post ID. so two posts containing #sunset could end up on two different worker machines. each does incr(sunset, 1) independently instead of one doing incr(sunset, 2). not the absolute minimum db calls.
if you want to minimize further, you write an adapter that reads from POST_PUBLISH, extracts hashtags, and writes to a new kafka topic POST_HASHTAG partitioned by hashtag. now all #sunset events land on the same consumer. your batching is maximally efficient.
┌──────────┐ ┌──────────┐ ┌───────────────┐ ┌────────────┐ ┌──────────┐
│ Post Svc │───>│ Kafka │───>│ HashTag │───>│ Kafka │───>│ Counting │──> DB
│ │ │ POST_PUB │ │ Extraction │ │ POST_HASH │ │ Servers │
│ │ │ (post_id)│ │ (fan out) │ │ (hashtag) │ │ │
└──────────┘ └──────────┘ └───────────────┘ └────────────┘ └──────────┘
but — this adds another kafka topic, more infra to manage, more cost. the benefit should outweigh the operational complexity. for most cases, the two-buffer batching approach is good enough. don't over-engineer.
read path optimization: CDN
one last thing on the read side. for a given hashtag, the count doesn't change every second. the top 100 photos don't rotate every minute. this data is relatively stable. perfect candidate for a CDN. stick a CDN in front of your hashtag API, set a reasonable TTL, and most requests never even hit your backend.
┌────────┐ ┌────────┐ ┌─────────────┐ ┌────────────┐
│ Client │───>│ CDN │───>│ Hashtag API │───>│ Hashtag DB │
│ │<───│ │<───│ (origin) │<───│ │
└────────┘ └────────┘ └─────────────┘ └────────────┘
since the hashtag page has no personalization — #sunset looks the same for everyone — there's no reason not to cache it on CDN. anything on CDN, assume it's public.
key takeaways from hashtag service
- kafka as a glue binding services together. post service publishes, hashtag workers consume, feed/notification/search services also consume.
- adapter pattern for repartitioning — if you're unhappy with the partition key, write a relay agent that reads and repartitions.
- read path vs write path — separate them, optimize them independently. reads get replicas, caches, CDNs. writes get batching, buffering, partitioning.
- wear three hats — architect, product manager, engineer. a senior engineer proposes product changes that simplify the system.
part 2: newly unread message indicator
the requirement
you know that little number on the messaging icon in linkedin, twitter, instagram? that's not the total number of unread messages. it's the number of unique people from whom you received new messages since you last tapped on the icon.
so if your friend sends you 100 messages, the indicator shows 1, not 100. if three different people each send you messages, it shows 3. you tap the icon, the counter resets. next time someone messages you, it starts counting again.
sounds simple. let's see.
approach 1: count on the fly
your schema in mysql (or any relational db) looks like:
┌──────────────┐ ┌──────────────────────────┐ ┌──────────────────┐
│ users │ │ messages │ │ user_activity │
├──────────────┤ ├──────────────────────────┤ ├──────────────────┤
│ id │ │ id │ │ user_id (PK) │
│ email │ │ msg │ │ last_read_at │
│ name │ │ from │ │ │
│ │ │ to │ │ │
│ │ │ timestamp │ │ │
└──────────────┘ └──────────────────────────┘ └──────────────────┘
user_activity.last_read_at resets to current time whenever the user taps the message icon. the query to get the unread count is literally:
SELECT COUNT(UNIQUE(from))
FROM messages
WHERE to = ?
AND timestamp > last_read_at
that's it. what looks like a super complicated system is literally this one query. at small to medium scale, this works just fine.
but will this scale? that depends entirely on your indexing strategy. and this is where most people mess up.
the indexing deep dive
let's say you blindly create individual indexes on to and timestamp. seems reasonable right? here's how the query evaluates.
take user C who was last online at timestamp 11:11. the query fires WHERE to = C AND timestamp > 11:11.
the to index (B+ tree, ordered by to then id) gives you all messages ever sent to C. could be millions of rows across 10 years. the timestamp index gives you all messages after 11:11 — potentially the entire table if the timestamp is old enough. then the database does a set intersection of these two result sets to find the matching rows.
imagine millions of entries from the to index intersected with millions from the timestamp index. catastrophically slow.
composite index to the rescue
the fix is a composite index on (to, timestamp, from). now look at how the data is laid out:
┌──────────────────────────────────────┐
│ Composite Index (to, ts, from) │
├──────┬───────┬──────┬────────────────┤
│ to │ ts │ from │ id │
├──────┼───────┼──────┼────────────────┤
│ B │ 11:10 │ A │ 1 │
│ C │ 11:10 │ A │ 2 │
│ C │ 11:21 │ A │ 4 │
│ C │ 11:22 │ B │ 5 │
│ D │ 11:20 │ A │ 3 │
│ D │ 11:23 │ B │ 6 │
└──────┴───────┴──────┴────────────────┘
now when you fire WHERE to = C AND timestamp > 11:11:
- O(log n) — binary search to find where
to = Cstarts - skip — within C's entries, jump to where
timestamp > 11:11 - k sequential reads — scan only the matching entries
total: O(log n + k) where k is the number of matching rows. no set intersection. no pointed lookups on the main table for the from column (because it's baked into the index). done.
now compare:
- without
fromin index:O(log n + k) + k × O(log n)— for each of the k matched rows, you do a pointed lookup on the main table to get thefromcolumn for theCOUNT(UNIQUE(from)). - with
fromin index:O(log n + k)— everything you need is right there in the index.
the overhead? just 8 extra bytes per index entry for storing from. even with 10 million messages, that's 80 MB. peanuts compared to the compute savings.
and yes, this index is hyper-optimized for this one query. that's perfectly fine. it's not an anti-pattern to optimize for your most critical query. as long as it works for your product and business, do it.
also — the order of columns in the composite index matters. if you did (timestamp, to, from) instead, the first column would match a huge range of timestamps, then you'd have to linearly scan through all of them to filter by to. completely defeats the purpose. the index should be ordered to match your query's selectivity — to first (narrows to one user), then timestamp (narrows to recent messages), then from (avoids main table lookup).
approach 2: pre-computed with redis
the on-the-fly approach works at decent scale. but if you want to pre-compute, here's how.
┌─────────────┐ ┌─────────────────┐ ┌───────────────┐
│ Messaging │────>│ Kafka │────>│ Status Update │
│ Service │ │ ON_MSG_UNSENT │ │ Workers │
└─────────────┘ │ (partitioned by │ └───────┬───────┘
│ receiver_id) │ │
└─────────────────┘ │ SADD
v
┌─────────────┐ ┌──────────────────┐
│ Status API │<─────────────────────────│ Redis Cluster │
│ │ get_status: SLEN │ │
│ │────>clear_status: DEL────>│ A: {B, C, D} │
└─────────────┘ │ B: {C, D} │
└──────────────────┘
whenever a message can't be delivered in real-time (user is offline or inactive), an ON_MSG_UNSENT event hits kafka partitioned by receiver ID. workers consume it and fire SADD on redis — adding the sender to the receiver's set.
so if D sends A 10 messages, the worker fires SADD A D ten times. but since it's a set, D only appears once. the count? just SLEN A. returns 3 if A has messages from B, C, and D.
when A taps the message icon: DEL A. counter reset. done.
the read path is a single SLEN call. the clear path is a single DEL. both O(1). the write path is SADD calls from workers.
the punching bag pattern
now here's the micro-optimization. if D sends A 100 messages, the worker fires SADD A D a hundred times. but after the first one, D is already in the set. the remaining 99 are redundant operations — they don't change the data but they still consume redis resources.
at peak load, imagine 90% of your commands are redundant. your redis cluster is both read-heavy (polling for status) and write-heavy (ingesting events). every unnecessary command adds up.
the punching bag pattern is about protecting your critical component. you add an auxiliary redis replica in front. before firing SADD on the primary, you check the replica — is this member already in the set? if yes, skip. if no, write to primary.
┌────────────┐ ┌───────────────┐ ┌───────────────┐
│ Workers │───>│ Auxiliary │───>│ Redis Primary │
│ │ │ Redis │ │ Cluster │
│ check if │ │ (replica) │ │ │
│ exists │ │ │ │ A: {B,C,D} │
│ then write │ │ A: {B,C,D} │ │ │
└────────────┘ └───────────────┘ └───────────────┘
this isn't your day-zero solution. it's a day-n optimization when you observe that a huge percentage of writes are redundant. the pattern shows up everywhere — rate limiters are essentially punching bags that absorb load before it hits your core system.
two flavors of the punching bag:
- streaming buffer — batch and buffer writes before they hit the db (like our in-memory map in the hashtag service)
- check-and-set — discard redundant operations before they reach the critical component
the bigger picture
both these systems follow the same structural approach — identify the read path and the write path, then optimize them independently.
for reads: add replicas, add caches, add CDNs, pre-compute data, denormalize where it makes sense, create hyper-optimized indexes.
for writes: buffer, batch, use kafka as a glue, use the two-buffer swap to minimize stop-the-world, repartition data if needed, add punching bags to reduce redundant operations.
┌──────────────┐
READ PATH ───>│ │<─── WRITE PATH
│ Database │
- replicas │ │ - batching
- cache │ (the core) │ - buffering
- CDN │ │ - kafka
- indexes │ │ - partitioning
└──────────────┘
and the most important takeaway? don't skip steps. don't jump to the complex solution because it sounds impressive. start simple. measure. find the actual bottleneck. then optimize. most people love being at the peak of the complexity curve. but just one extra push of thinking can simplify your solution dramatically.
nothing is best. everything depends on the usecase. and the answer almost always lies in the gray area — not purely in engineering, not purely in product, but in that sweet spot where both work together.