Tautik Agrahari

geohash vs quadtree vs r-tree - three ways to find what's near you

every time your phone shows "5 ubers within a mile" or yelp pulls up restaurants near your dot on the map, somewhere in the backend a system is solving the same problem: given a point on earth, find all the other points close to it, fast.

the naive answer is to scan every row in your database, compute the haversine distance to the query point, and return everything within some radius. this works when you have 100 points. it absolutely melts when you have 100 million.

so we use spatial indexes. three come up over and over: geohash, quadtree, and r-tree. they all partition space, but they do it in fundamentally different ways — geohash encodes points into sortable strings, quadtree carves space into rigid quadrants, r-tree wraps data in flexible overlapping rectangles. let's walk through each, end with a concrete yelp example, and figure out which one to reach for when.

the problem they solve

okay. concrete. you're building yelp. user opens the app at lat 40.7128, lon -74.0060 (manhattan). they want "coffee shops within 1 mile". you have 10 million coffee shops in your db.

a brute-force SELECT * FROM shops WHERE distance(lat, lon, 40.7128, -74.0060) < 1 does 10 million distance computations per query. dead. you'd burn through a server cluster before lunch.

what you actually want is: narrow the search to ~50 candidate shops in O(log n), then run the precise distance check on just those. that "narrowing" step is what spatial indexes do.

geohash, quadtree, and r-tree are three different bets on how to do the narrowing.

geohash — encode space into a string

geohash converts a (lat, lon) pair into a short base32 string like dr5ru7p. the magic is the prefix property: two points that are close on earth will, most of the time, share a long common prefix.

so dr5ru7p and dr5ru3z are nearby (both share dr5ru). dr5ru7p and 9q8yzn are very far (no shared prefix — different sides of the country).

how it actually encodes

geohash interleaves the binary representations of latitude and longitude. you take lat (range -90..90), repeatedly split it in half, write down whether your target is in the lower or upper half. same for lon (range -180..180). interleave the bits, encode in base32. each character adds about 5 bits of precision.

geohash encoding — lat/lon get recursively halved and bit-interleaved, then base32-encoded into a short string; longer strings = more precision; nearby points share a prefix

precision by length:

length cell size (approx)
4 chars ~39 km × 19 km
5 chars ~5 km × 5 km
6 chars ~1.2 km × 0.6 km
7 chars ~150 m × 150 m
8 chars ~38 m × 19 m

so for "coffee shops within ~1 mile", a 5-character geohash bucket is about right. you store every shop pre-computed with its 5-char geohash, then the query becomes "give me all shops where geohash = 'dr5ru'."

why redis loves this

geohashes are sorted strings. they fit perfectly into a redis sorted set, or any b-tree index. redis actually has built-in commands for this — under the hood it's just a sorted set of geohash-encoded scores.

GEOADD shops -74.0060 40.7128 "blue-bottle"
GEOADD shops -74.0050 40.7130 "joe-coffee"

GEOSEARCH shops FROMLONLAT -74.0060 40.7128 BYRADIUS 1 mi

dead simple. no postgis, no kd-tree, no custom service. just a sorted set with magic strings.

the edge case nobody mentions on day 1

geohash has one big gotcha: points right next to each other can fall in different cells.

imagine the cell boundary runs right through times square. a hotdog stand 5 meters east of the boundary has geohash dr5ru7. a coffee shop 5 meters west has geohash dr5rgz. they're 10 meters apart, but their geohashes don't even share the second character.

geohash boundary problem — two nearby points on opposite sides of a cell boundary get totally different prefixes; fix is to query the user's cell + its 8 neighbors

the fix is mandatory: always query the user's cell AND its 8 neighboring cells. there are well-known formulas to compute the 8 neighbors of any geohash, so this is cheap. you end up querying 9 cells instead of 1, then filtering the result with exact haversine distance.

so the real query loop is:

  1. compute the user's geohash at the right precision
  2. compute its 8 neighbors → 9 cells total
  3. fetch all points whose geohash is in that set
  4. compute exact haversine distance for each, filter to radius
  5. return

step 4 is the unavoidable "refinement" step. the index narrows you from 10 million to ~50–200 candidates. the haversine check filters those to the final answer.

quadtree — recursive 2D split

quadtree comes at it differently. instead of encoding each point, it organizes space itself into a tree.

start with the whole 2D map as a single box. that's the root. when too many points are in one box (say, more than 4), split the box into 4 equal quadrants — NW, NE, SW, SE. each quadrant becomes a child node. if any of those children gets too crowded, split that one into 4 more. recurse.

quadtree structure — root is the full map, busy areas get split deeper while empty areas stay shallow; downtown nyc has lots of small leaves, the atlantic ocean stays one big block

the beauty is density-adaptive: downtown manhattan ends up with tiny leaf nodes (lots of recursive splits) while the middle of nebraska might be a single huge leaf. you're not wasting tree depth on empty space, and you're not letting busy areas get overcrowded.

to find points near (lat, lon):

  1. start at the root
  2. descend into whichever quadrant contains your query point
  3. recurse until you're in a leaf
  4. that leaf contains the candidate points (and you check neighboring leaves too — same boundary problem as geohash, different mechanics)
  5. exact haversine check, return

each step picks 1 of 4 children, so search is O(log₄ n) — same shape as O(log n), just a tighter constant.

tradeoffs vs geohash

r-tree — flexible, overlapping rectangles

both geohash and quadtree split space on fixed grid lines — every cell or quadrant has a predetermined position and size. r-trees throw that constraint out. instead, they wrap the data in flexible, overlapping bounding rectangles that adapt to wherever the data actually clusters.

think of organizing a pile of photos on a table. quadtree would force you to draw a fixed 2×2 grid and keep subdividing. r-tree just lets you draw natural rectangles around clumps of nearby photos, even if those rectangles overlap a bit.

r-tree — flexible overlapping rectangles that adapt to data; left shows MBRs on a 2d map, right shows the tree of bounding boxes

each rectangle is a minimum bounding rectangle (MBR) — the smallest axis-aligned box that contains all its children. the root has a few wide MBRs, each child has tighter ones, all the way down to leaves that hold actual data items.

why this matters

r-tree is the default in production databases. postgresql's gist indexes (and postgis) use r-trees. so does mysql's spatial index, oracle spatial, and basically every commercial spatial database. there's a reason for that.

the trade-off

rectangles can overlap. when a query point sits inside two overlapping rectangles, you have to descend into both branches and union the results. lots of overlap means the index does extra work.

production r-tree implementations spend a lot of cleverness on the splitting heuristic — when a node fills up and has to split, how do you divide its children into two new MBRs that minimize overlap? this is where r-tree, r±tree, r*-tree, and hilbert r-tree differ. all the same shape, different splitting strategies.

in interviews and at the whiteboard: just say "r-tree" and you're fine. the variants matter for production tuning but not for the architectural conversation.

a concrete example — yelp "coffee near me"

let's walk through the same query with both approaches.

user at (40.7128, -74.0060) asks for coffee shops within 1 mile.

geohash version

# precompute (done once per shop, stored in db / redis)
shop "Blue Bottle"    → geohash dr5ru7p
shop "Joe Coffee"     → geohash dr5ru3z
shop "Stumptown SoHo" → geohash dr5rsbk
shop "...10M others"

# query time
user_geohash = encode(40.7128, -74.0060, length=5) = "dr5ru"
neighbors    = ["dr5ru",  # the user's cell
                "dr5rg", "dr5rs", "dr5rv",
                "dr5rt",          "dr5rw",
                "dr5rh", "dr5rk", "dr5re"]  # 8 neighbors

# fetch all shops whose geohash starts with one of these 9 prefixes
candidates = SELECT * FROM shops WHERE LEFT(geohash, 5) IN (...9 values...);
# ~50–200 results

# exact distance filter
results = [shop for shop in candidates
           if haversine(user, shop) <= 1.0]

simple. redis-friendly. one round trip. ship it.

quadtree version

# precompute (built in memory or stored in postgis-style index)
quadtree = build_quadtree(all_10M_shops)

# query time
user_point = (40.7128, -74.0060)
leaf       = quadtree.find_leaf(user_point)
candidates = leaf.points + leaf.neighbors_within(1_mile)

# exact distance filter
results = [c for c in candidates if haversine(user, c) <= 1.0]

same overall shape, different data structure. the quadtree gracefully handles the fact that manhattan has 100x the shop density of staten island, while geohash needed you to pick a single global precision.

both flows side-by-side — spatial index narrows candidates from 10M to ~100, then exact haversine distance filter yields the final answer

when to pick which

use case better fit why
uber nearby drivers geohash drivers move every few seconds; just update the geohash string in redis. classic pattern.
tinder location matching geohash users update location periodically, sorted set, done
anything you want in redis quickly geohash it's literally a built-in
mapping with wildly varying density quadtree density-adaptive splits give better worst-case performance
physics sim, collision detection quadtree spatial reasoning over points matters more than encoding
yelp restaurant search at production scale r-tree (via postgis) mixed data — points + polygons (delivery zones, neighborhoods) — and you want a real db query, not just a redis lookup
mapping app indexing roads + POIs together r-tree only structure that gracefully handles points + polygons + linestrings in one index
postgis, mysql spatial, oracle spatial r-tree they literally use r-tree — that's the default

rule of thumb:

the universal pattern

both approaches share the same two-step structure:

  1. index → narrow. use the spatial structure to go from millions of candidates to a few hundred. cheap and fast.
  2. exact distance → refine. for each candidate, compute the real haversine distance. more expensive per-row, but only running on the narrowed set.

the index doesn't give you the right answer. it gives you a candidate set small enough that the exact answer is cheap to compute. forgetting step 2 is how you ship a "nearby" feature that includes shops 10 miles away just because they happened to share a geohash cell.

what i'd remember

nothing here is novel. redis has GEOADD, postgis has ST_DWithin, mongodb has 2dsphere indexes. the design is in picking the right tool, remembering the boundary fix, and not over-engineering. you're paid to solve a problem, not to ship the fanciest architecture.