In this project, I analyzed customer churn data for a fictional EdTech company to identify patterns and insights on leads who did not convert. I used SQL for data preparation, familiarizing myself with subqueries and views, and deepening my understanding of basic window functions. I did a little more data transformation in Power BI, and then finally created the dashboard that focuses on the characteristics of leads who dropped out of the customer acquisition flow.
I have included a longer write up below that goes more in-depth about how I approached this project, and I have also included my insights that may help the company better understand why leads are churning.
Write up:
The first thing I did was create a rough Entity Relationship Diagram to map out the relationships between the different tables in the dataset:
This was a new experience for me, as an ERD has always been provided with the previous datasets I have used. Developing this diagram helped me understand how tables connected through the common lead_id field, and this was essential for planning and writing my SQL queries and ensuring I was pulling the right data from each table.
Data Preparation & SQL:
To ensure data accuracy, I cleaned the data and removed outliers. Specifically:
-Excluded age outliers (age < 35), and capped watched_percentage at 100% (watched_percentage refers to the demo that leads watched during the awareness stage).
-Standardized inconsistent values (e.g., combining “Cannot afford” and “Can’t afford”
I then explored the data to identify the churn stage for each lead. The author of the dataset stated: “A lead can drop out at any stage of the flow. If there is no call by the junior sales manager to the lead after a certain stage then the lead is considered as dropped at that stage.”
Therefore, in my initial analysis, I used call_done_date to try to determine the stage at which a lead churned. The assumption was that the most recent call_done_date would reflect the final stage a lead reached before either converting or dropping out. However, I noticed inconsistencies that made this approach unreliable. For example, in a few instances, a lead appeared to reach multiple stages on the same date, making it difficult to identify their actual final stage solely based on date order. I think it's more likely that the author was correct and that I need to continue sharpening my SQL skills to work through an issue like this.
Regardless, to address this, I implement a stage ranking approach, using a CASE statement to assign an order to each stage (1 for Lead, 2 for Awareness…). Then, I applied the MAX function to this ordered ranking, allowing me to determine each lead’s highest (or final) stage in the funnel, independent of call dates. This method ensured that I captured the actual final stage each lead reached, providing a potentially more accurate basis for identifying where leads churned or converted.
Here is my final query, which I then turned in to a view and imported into Power BI.
Power BI:
In Power BI, I performed additional data transformations to ensure the data was in an optimal format for analysis. For example, I created a table for age groups, so that I could visualize the relationship between age and % of leads that churned.
I then created an interactive dashboard that includes a variety of visualizations.
Key Insights:
-Overall, "Cannot Afford" is the biggest reason leads are churning (33.56%). "Wants offline classes" is the second biggest reason (31.56%).
-Overall, leads whose parents are government employees make up the highest percentage of churned leads (37.3%)
Affordability is a common factor among leads across many characteristics and lead stage.
Recommendation:
Targeting higher-income students could be an effective way to reduce churn by building a segment less likely to experience affordability concerns.
Additionally, the company could look into introducing flexible payment plans or scholarships geared toward leads who express financial concerns.
Final Thoughts:
Additionally, the company could look into introducing flexible payment plans or scholarships geared toward leads who express financial concerns.
Final Thoughts:
This project was a great learning experience. I definitely need to familiarize myself more with subqueries, as I initially used them when using the last call_done_date to determine the stage a lead churned at. Although I didn't use subqueries in my final query, it is an area of SQL that I need to dive further into in order to sharpen my skills.
I also plan to work more with this data and do an analysis on the funnel as a whole, as opposed to looking only at leads that churned.