Graphing net churn using Redshift, DBT and Mode Analytics
-
Data ManagementUpdatedPosted:
On this page
When it comes to basic data analytics, there are three key things that have changed over the last few years:
- It is easier than ever to set up a data warehouse with virtually no data limits, such as Amazon’s AWS Redshift or Google’s Big Query.
- It is easier than ever to synchronize data from SaaS products into such warehouses, with tools like Stitch and Segment.
- There are several easy-to-use, effective tools that make it trivial to query and chart the plethora of data in your warehouse in a sophisticated way, such as Mode Analytics and Looker.
At Vero, we’re big proponents of these advancements and how these tools work together. In most scenarios, the “data pipeline” outlined above is superior to tools like Mixpanel and Amplitude as it enables greater data accuracy and completeness.
In this guide, I’m going to introduce you to storing and analyzing data directly in Amazon’s Redshift using DBT and Mode Analytics. We’re going to chart some the basic SaaS metrics of MRR upgrades, downgrades, new and churned MRR to get started.
I’ve included the actual SQL we use at Vero, and I’ve broken down how it works so you can use it, get inspired by it or build upon it.
The full repository of SQL can be found here.
You can apply the same approach I use here to chart virtually anything with this setup, it’s very powerful.
Charting upgrades, downgrades, new revenue and net churn
Here’s an example of the chart we’ll be building (note that “news” means new MRR in this chart)
What you’ll need for graphing net churn
Here’s the setup you’ll need to get accurate and complete data and to chart it beautifully:
- An AWS Redshift database to store all of your data.
- Stitch to synchronise data from Stripe (or your invoicing platform) to Redshift.
- DBT to turn the unorganised, raw data that is synchronised from Stripe into Redshift into clean database “views” you can query to create pretty charts.
-
Sinter to run
DBT
queries daily (or more often) so that your charts are up to date. - Mode Analytics to query the clean database views generated by DBT and Sinter to create beautiful charts.
All up, these components cost around $480 USD per month. They’re all scalable to virtually any limit, so you can rest assured they’ll scale with you as you grow.
Synchronising your data to Redshift
The SQL
To help you, we’ve pulled together a Github repository with all of the SQL needed to create these charts. As you become more familiar with DBT, you’ll observe that this repository is structured in line with DBT’s best practices and is a module that you should be able to import into another DBT project.
The SQL here ultimately outputs a table called
revenue_movements
. This table contains one row for
each month and one column each of the metrics
new_revenue
, churned_revenue
,
upgrade_revenue
and downgrade_revenue
.
The resulting table can easily be queried using a tool like Mode.
As mentioned above, I recommend you download the full SQL and follow along in a text editor like Sublime.
Below I’ll step through each of the SQL queries used to build up
the revenue_movements
view.
Getting a nice clean list of all invoices
All of the SQL we’ll be writing here is formatted in line with the
structure outlined by DBT. Every SQL query is a
SELECT
and results in a table or view. This table or
view can then be used by other DBT queries. This “stacking” helps
create clean and reusable SQL.
Onto the SQL! The first step in the process of creating our chart
is to generate a view called stripe_invoices
that has
a row for every single customer invoice.
When using DBT
, I like to follow a standard structure
for my SQL code inside the models
folder:
-
I have a folder named after the source of the data, e.g.
stripe
. -
Within that folder I always have a
base
folder. The SQL files inside thebase
folder are responsible for querying the underlying raw dataset and returning only the columns needed by our queries later on.- I will also do exclusions in these SQL files if I have any.
-
Files in this folder are always named using the format
PROVIDER_MODEL.sql
, for examplestripe_invoices.sql
.
-
I also have a
transform
folder. The SQL in this folder is responsible for taking thebase
output, e.g.stripe_invoices
, and converting values into a more usable format, where useful.-
These files are always named with the
_xf
suffix, as instripe_invoices_xf.sql
. - Critically, the base SQL should never be queried other than by it’s corresponding transform query. The point is that all queries from here on use the transformed models, and never directly call the base models. This gives everything a reliable structure with a lot of flexibility.
-
These files are always named with the
Here’s the SQL from the stripe_invoices
base table:
select
id as id,
amount_due as amount_due,
customer as customer,
date as date,
period_end as period_end,
forgiven as forgiven,
paid as paid,
subscription as subscription,
total as total,
starting_balance as starting_balance
from
{{ var('invoices_table') }}
where
-- Remove customer IDs of users who will mess up MRR and other figures (e.g test accounts)
customer not in ('cus_3Ma5IxIINayMgK')
As you can see, it’s pretty basic. The next step is to add in a few transformations.
In this particular example, I want to include customers’
email
addresses, but this data is not available in
the raw Stripe invoices table, so I need to join another table to
get it. Similarly, I want to include an index of the invoice
number for each customer (i.e. first invoice, last invoice, etc.)
so I’ll add a query for this.
Here’s the transform SQL:
with
stripe_invoices_with_line_items as (
select
*
from
{{ ref('stripe_invoices_with_line_items_xf_aggregated') }}
),
usage_based_recipients as (
select
*
from
{{ ref('usage_based_recipients') }}
)
select
stripe_invoices.id as id,
stripe_invoices.customer as customer,
stripe_customers_xf.email as customer_email,
stripe_invoices.date as date,
stripe_invoices.period_end as period_end,
stripe_invoices.forgiven as forgiven,
stripe_invoices.subscription as subscription,
stripe_invoices.paid as paid,
stripe_invoices_with_line_items.total as total,
row_number() over(
partition by stripe_invoices.customer
order by stripe_invoices.date desc
) as last_payment,
row_number() over(
partition by stripe_invoices.customer
order by stripe_invoices.date asc
) as first_payment
from
{{ ref('stripe_invoices') }} stripe_invoices
left outer join
stripe_invoices_with_line_items
on
stripe_invoices.id = stripe_invoices_with_line_items.id
left outer join
{{ ref('stripe_customers_xf') }} as stripe_customers_xf
on
stripe_customers_xf.id = stripe_invoices.customer
As you an see, I’m leveraging another transformed dataset, the
customers_xf
table. The
invoices_xf
resulting from the above query has all of
the data I might want later.
Querying for new revenue per month
Now that we’ve done to the hard work of creating a clean table of all of our customer invoices, it is relatively easy to query for revenue movements.
Let’s start with “new revenue per month”, as per the file
mrr_new.sql
:
select
date_trunc('month', date) as month,
sum(total)
from
{{ref('invoices_by_customer')}}
where
-- Find customers who had their first invoice this month
asc_row = 1
group by
date_trunc('month', date)
This SQL finds all of the initial invoices for each customer and
groups and sums them by month (e.g. January, February, etc.).
You’ll note that the SQL leverages a table called
invoices_by_customer.sql
:
with
invoices_by_customer as (
select
customer_id,
date_trunc('month',date) as month,
sum(total) as total
from
{{ ref('invoices') }}
group by
customer_id,
date_trunc('month',date)
)
select
*,
row_number() over (partition by customer_id order by month asc) as asc_row
row_number() over (partition by customer_id order by month desc) as desc_row
from
invoices_by_customer
This SQL groups the invoices by customer and month. This interim
step is key as it is possible in Stripe that customers have two
invoices in a single month. This aggregation ensures that for each
month there is just one total “invoiced” count per customer. The
asc_row
thus accurately represents an index for each
customer’s invoices starting in the first month they paid and
counting upwards.
Querying for upgrades per month
Querying for upgrades requires comparing two months and
determining the difference between those two months. The SQL is
located in mrr_upgrades.sql
:
with
find_upgrades as (
select
invoices.month as month,
invoices.asc_row as asc_row,
invoices.total as current_month_total,
coalesce(invoices_offset_one_month.total,0) as last_month_total,
invoices.total - coalesce(invoices_offset_one_month.total,0) as change
from
{{ref('invoices_by_customer')}} invoices
left outer join
{{ref('invoices_by_customer')}} invoices_offset_one_month
on
add_months(invoices.month, -1) = invoices_offset_one_month.month
and
invoices.customer_id = invoices_offset_one_month.customer_id
)
select
month as month,
sum(change) as revenue
from
upgrades
where
-- Only include customers whose current month total is more than last month (upgrades!)
abs(upgrades.current_month_total) >
upgrades.last_month_total
and
-- Exclude customers who didn't move plans this month
upgrades.change <> 0 and upgrades.change is not null
and
-- Exclude customer's first invoices, as these are actually "new" customers, not upgrades
upgrades.asc_row <> 1
group by
upgrades.month
This query leverages the same
invoices_by_customer
table twice – once to query the
total for “this month” and once to query the total for “last
month”. It then displays the difference each of
these totals.
We also need to ensure we exclude:
-
Customers that have their first invoice
this month, as those are already counted in the
earlier
mrr_new
query. They represent customers who are new and, as they didn’t have an invoice prior to their starting, there will always be a change recorded in their first month. -
Customers who have invoices with a total of
0
this month, as these are customers who have churned – we’ll query them separately. - Customers who have a negative change, as these are customers who have downgraded since last month.
We’re left with a clean table containing the upgrades per month across our invoice history.
Bringing it all together
The files mrr_churns
and
mrr_downgrades
provide the SQL to chart these
numbers. They are similar to the queries above.
With the key metrics of new revenue, churned revenue, upgrades and
downgrades accounted for, we can now bring these figures together.
DBT separates models
(like those above) and
analysis
SQL files into two. I always put the final
SQL that I will copy and paste into Mode into the
analysis
folder. The SQL within that folder
represents each SQL query used to build our dashboards.
The file revenue/mrr_movements.sql
joins the
mrr_new
, mrr_upgrades
,
mrr_downgrades
and mrr_churns
tables
discussed above:
select
new_revenue.month,
new_revenue.revenue as new_revenue,
churned_revenue.revenue as churned_revenue,
upgrade_revenue.revenue as upgrade_revenue,
downgrade_revenue.revenue as downgrade_revenue
from
{{ref('mrr_new')}} new_revenue
left outer join
{{ref('mrr_churned')}} churned_revenue
on
new_revenue.month = churned_revenue.month
left outer join
{{ref('mrr_upgrades')}} upgrade_revenue
on
new_revenue.month = upgrade_revenue.month
left outer join
{{ref('mrr_downgrades')}} downgrade_revenue
on
new_revenue.month = downgrade_revenue.month
where
-- This works in Redshift, not in Postgres
new_revenue.month < date_trunc('month',dateadd(months,-1,current_date))
order by
new_revenue.month desc
There are two important things about this query:
-
Using
left outer join
ensures that if any of the joined tables are missing a value for a certain month, those months are included but with anull
value. - We exclude the current month as always renders incomplete and shows generally alarming figures until the month ends.
Why use SQL at all?
The thing I find most powerful about charting in this way is that
the result is idempotent
. This is a fancy engineering
word that means “you can run it hundreds of times and get the same
output”.
Thanks to Stitch, Redshift, and DBT you can reload the raw data and re-run the queries and get the exact output every time. One of the challenges I’ve always found with Amplitude, Mixpanel, and others is that sending data via API makes it easy to load raw data twice or to have poor visibility on which data is present. I feel this approach solves that.
On top of that, the beauty of using SQL for this sort of charting is that you can:
- Fully customise your charts.
- Query across datasets and event databases, allowing you to do really sophisticated analyses.
- Check your SQL code into GitHub or similar so you have a versioned history of every query and dashboard.
We’ve found running SQL queries allows us to produce consolidated reports which actually get looked at and actioned. It’s forced us to be selective about what we query and this may be one of the most valuable aspects of the whole thing!
Note: this script is slightly different from the one we use in production and has been pared down for this article, so if you see improvements or breaking code (though, it has been tested with our setup), please let me know and we’ll update it.