Calculate Cohort Retention Analysis with SQL

Cohort Retention Analysis is a powerful thing that most business owners need to look at.

Retention: If first-time user A goes to the store on Week 1, and returns to the store the next week, he is a returned user. If user B also goes to the store on Week 1 and does not return the next week, he’s a bounced user, that basically means you lose him as a user (despite all the marketing money you spent to lure him to your store).

Cohort Analysis is a technique to see how variables change in different groups given different starting conditions (a cohort).

Defining Requirements

In this post, we’ll show you how to build the below Cohort Retention:

Cohort viz

The above chart tells you a few things:

We cohort users by their signup month. This is the users table below.
For each cohort, we show the population (how many people) in that cohort in the second column
An activity is an action user did to the site (this could be simply view the page, or as specific as making a purchase). We denote this as table activities below.

We need the following tables and fields:

users:
- id
- timestamp

activities:
- user_id
- timestamp

Bucketing Users Into Cohort

First we bucket them into different cohort by their sign up month, and store into cohort_items

-- (user_id, cohort_month), each 
with cohort_items as (
  select
    date_trunc('month', U.timestamp)::date as cohort_month,
    id as user_id
  from users U
  order by 1, 2
)

After that, we build user_activities which

-- (user_id, month_number): user X has activity in month number X
WITH user_activities as (
  select
    A.user_id,
    MONTH_DIFF(
      date_trunc('month', A.timestamp)::date,
      C.cohort_month
    ) as month_number
  from public.activities A
  left join cohort_items C ON A.user_id = C.user_id
  group by 1, 2
)

The above would return all the pairs of (user_id, month_number) that indicates if a user is active in that month after their original signup date.

| user | month_number |
| Alex | 0            |
| Alex | 1            |
| Bob  | 0            |
| Bob  | 2            |

The MONTH_DIFF is a user-defined function that takes in 2 dates, and return the number of months between them.

Cohort Size: is simply how many users are in each group

-- (cohort_month, size)
with cohort_size as (
  select cohort_month, count(1) as num_users
  from cohort_items
  group by 1
  order by 1
),

And finally, putting them together with the below:

-- (cohort_month, month_number, cnt)
with retention_table as (
  select
    C.cohort_month,
    A.month_number,
    count(1) as num_users
  from user_activities A
  left join cohort_items C ON A.user_id = C.user_id
  group by 1, 2
)
-- our final value: (cohort_month, size, month_number, percentage)
select
  B.cohort_month,
  S.num_users as total_users,
  B.month_number,
  B.num_users::float * 100 / S.num_users as percentage
from retention_table B
left join cohort_size S ON B.cohort_month = S.cohort_month
order by 1, 3
where B.cohort_month IS NOT NULL

And you would get:

Cohort raw

With Holistics, we can automatically turn them into beautiful cohort table like so:

Cohort viz

The Full SQL

Here’s the full SQL for your reference. Note: you just need to change the table names and column names inside the first 2 WITH clauses, and the rest should work as per normal. View the full SQL