Clov Log - Roadmap

We have compiled this roadmap as a way to navigate the vast field and countless amount resources. We hope this can serve as inspiration for those of you looking to formulate your own journeys.

For serious learners, it's recommended to build goals, actionable tasks, and deadlines surrounding this (or your own) roadmap to stay on track.

Always be building projects and applying everything you learned! Participating in Kaggle competitions is a great way to do this. The biggest thing is to avoid accumulating theory without any sort of practice to cement it.

Note: This document is continuously evolving as we ourselves progress and learn more.

Stage 0 - Mathematics

Before diving into code, take a step back and build up your fundamentals. Math is the language of ML, and we must learn to understand it fluently.

Linear Algebra

MIT OCW course by Prof. Strang. This is a legendary course and you must understand all of it before moving on. A strong understanding of Linear Algebra is absolutely necessary.
To get a visual intuition: 3Blue1Brown
Stanford Review: CS229 Linear Algebra

Multi-Variable Calculus

First, review your single-variable calculus. If you haven't already taken Calculus 1 & 2, definitely go through the MIT calculus course first. For some visual Intuition, check out 3Blue1Brown

A lot of multi-variable calculus is more used for physical modeling, and less often with messy real world discrete data (which use numerical methods). Nevertheless, there are crucial aspects of multi-variable and matrix calculus that every data scientist must know to understand the inner workings of concepts like backpropagation, gradient descent, etc.

You can learn the necessary multi-variable calculus by going through Units 2 and 3 of Khan Academy Skip curvature, divergence, curl, and laplacian.

Now you're ready for the matrix calculus required for deep learning. Read up until section 5 (page 23). The rest is application of matrix calculus into neural networks, which is optimally learned during stage 3. The goal of this stage is to primarily just grok the fundamental pure math prerequisites.

Probability

Primary: Harvard STAT 110
Then, learn statistics fundamentals from StatQuest: StatQuest YouTube Playlist
Intuitive visualizations (supplement): Seeing Theory
Reviews:
- CS229 Probability Review
- Probability Stats for DS
Learn PCA: Towards Data Science

Stage 1 - Python & Data Analysis

If you aren't familiar with Python yet, go through the official Python tutorial.

To performing exploratory data analysis, feature engineering, and understanding your data all require fluency with data wrangling tools in Python. These include Pandas, Numpy, Matplotlib, and Seaborn.

Luckily, the inventor of Pandas has a free online book covering everything you need to learn. The final chapter is arguably the most important, where he walks you through several datasets and performs EDA.

Stage 2 - Machine Learning

Congratulations on making it to this stage! You're now ready to dig into some hands-on machine learning. Andrew Ng's ML specialization is arguably the best course to get started with.

I consider it as a tour of many ML methods like regression, trees, clustering, neural networks, and more, while still providing the basic math knowledge behind them.

As it is merely a starting point, somewhere along the line you'd need to pick up Introduction to Statistical Learning to fill in the rest of the math that the course skipped. Feel free to start this anytime you'd like, but we're thinking of going through it after gaining enough practice and experience using it to solve real-world problems.

Stage 3 - Deep Learning

Next up, it's time to dive deep into deep learning!

I personally found 3b1b's neural network playlist to be a great starting place. His visual backpropagation and gradient descent explanation was also very understandable.

To go deeper and grokk more theory, go through Andrew Ng's Deep Learning Specialization.

Simultaneously, go through the Fast AI course which really equips you with practical knowledge for making your own projects. He also teaches with pytorch, which is recommended to be familiar with as most research papers utilize it.

Practice Reminder

Remember to always supplement your studies with practice! For instance, after learning the basics of neural networks, you can try to create your own neural network mini-library. This is also a great point for getting started with Kaggle.

For a wide assortment of ML project ideas, you can check out the following resources

Stage 4 - NLP (specialization)

For a broad overview of NLP, deeplearning.ai has a great resource.

By far the best course I've found on NLP is Stanford's CS244N. While you're going through CS244N, explore around with other NLP courses focused on specific topics. Here are some we found quite interesting:

Transformers: Hugging face NLP course
Andrej Karpathy: Let's Build GPT

Stage 5 and Beyond

It's extremely important to keep yourself up to date with the latest breakthroughs and tech, as this is a cutting-edge field. By this point, you should have toolkit necessary for reading research papers.

Further exploration:

Computer Graphics (e.g. OCR):
- http://cs231n.stanford.edu/
LLMs and Vector databases
- Llama Index, LangChain, Pinecone, etc.
Responsively delivering value with ML: Made With ML
- Project ideas: https://github.com/NirantK/awesome-project-ideas
Transformers and Transfer Learning
etc.. eternal learning -> Intellectual pursuits!

Data Science Roadmap