Week 10: Diving into NLP while fighting through Kaggle

February 4, 2024

Week In Review

And... it's week 10! Phew, what a rollercoaster. It feels great to already be near a stage where we're starting beefier projects and learning state of the art models.

It's exactly this stage that I was looking forward to when beginning my journey. Although I'm far from where I want to be, I still feel as though I have the wings to soar the land of research, and be able to appreciate the breathtaking work being done.

Especially NLP, there's just something about this subject that truly gets me excited! Without further ado, check out this week's progress.

TLDR:

  • ISLR Chapter 6, and reviewed 4, 5, and 6
  • Completed the titanic kaggle competition
  • NLP lectures 2-6
  • Began VNDB xDeepFM project

Progress Timeline

January 29, 2024

  • 10 pages of ISLR on PCA, PCR, and PLS
  • Started on the titanic competition and performed some basic EDA and used XGBoost to create a baseline model on minimally processed data. Throughout the week, I'd like to iteratively improve on this baseline model to achieve > 0.8 accuracy.

January 30, 2024

  • 10 pages, finishing chapter 6
  • Read through a few top notebooks of titanic and gathered ideas on how to improve my model. Definitely realized the importance of hyperparameter tuning!

January 31, 2024

  • Reviewed half of ISLR chapter 4 on classification in order to prepare for completing the problemsets on Sunday.
  • Decided to resume the stanford NLP course and ramp up the pace from my old plan of 2 a week to 1 lecture per day. Just fitting in a lecture in the day, treating it like any other class.
  • Finished up lecture 2, which was where I left off quite a while ago actually. Feels refreshing to get back to the good ol' theory. Felt a little ambitious and also watched lecture 3 on backprop (all review).

Besides that, I actually attempted to revisit an ambitious ML project idea for suggesting vocabulary words to language learners. I felt slightly overwhelmed since compared to Kaggle, starting this new research project felt beyond my reach as of now. I feel like I have the tools now, but piecing them in a way that makes sense is something I still lack. In the end, I decided it would be best to get some more experience first in a structured way (i.e. reading research, smaller projects).

Feburary 1, 2024

  • Finished reviewing ISLR chapter 4 on LDA, QDA, Naive Bayes, and other generalized linear models.
  • Watched NLP lecture 4 on dependency parsing. Boy was it a great taste of a mix of classical and computational linguistics! I found it nice that he compared the old methods of Dynamic Programming and MST-based approaches with the newer neural approaches. Always good to see the evolution of NLP!
  • More feature engineering on Titanic.

February 2, 2024

  • Watched NLP lecture 5 on RNNs. Cool to see Manning introduce them with such a clear motivation and context behind it.
  • Applied PCA and more visualizations on titanic dataset. Also used KNN imputing and learned about the missing forest algorithm
  • Reviewed all of chapter 5 ISLR.

Today was very packed with exams coming up next week and trying to balance my commitment with CLOV. Nonetheless, you already know we're pulling it off!

February 3, 2024

  • Watched NLP lecture 6 on LSTMs. Words can't express how much I'm looking forward to learning about BERT and other transformer models!
  • Finished final touches on my Titanic notebook and I'm ready to present tommorow.
  • Reviewed half of chapter 6 ISLR. Will finish up the rest of the review during the clov session

Besides the usual, I've encountered a project idea that I'm extremely passionate about. I'm planning on implementing an advanced, deep learning recommendation system for vndb.net, the imdb equivalent for visual novels. With a combination of wrangling real world data dumps, reading SOTA research, and presenting results with web programming, it's going to be one hell of an exciting project! Being eager to start, I already started looking into an overview of neural recommendation systems, including NCF, Wide & Deep, DeepFM, and more.

I honestly think this project will beef up my skills unlike a normal kaggle competition, as I have to deliver something production-ready for end-users. Paper grind incoming, stay tuned for more!

Feburary 4, 2024

  • Reviewed the latter half of ISLR chapter 6 and took notes.
  • Worked through ISLR chapter 4, 5, and 6 problem-sets with Sai. We're ready to get back to the 10 pages a day pace starting tommorow with chapter 7.
  • Brainstormed potential project ideas incorporating NLP/CV and SWE to create value. That's what CLOV is all about!
  • Began exploratory data analysis on the visual novel data dumps. There's so many columns, missing data, dozens of tables, SQL structure, etc. that this phase will take quite some time. Working on some real-world data is just the practice I need!

Changelog

As I got started with a decently sized project for implementing xDeepFM for VNDB with a production-ready UI, I won't be working on Kaggle as much for the time being.

We strongly believe that projects from scratch where you have to get your own data, form your own hypothesis, problem statement, etc. is the best form of learning.

The biggest change is that we're going to be publishing entries every 2 weeks, rather than every week.

Goals for next week

  • Continue EDA on VNDB & start reading the xDeepFM paper.
  • Complete first two assignments of CS224N
  • Complete chapter 7 and chapter 8, including problems
  • Continue 10 pages ISLR & 1 lecture a day pace