Music Recommendation System

Github Link: github.com/MetinUnlu/music-recommendation

Final project for Mining Massive Datasets course at Verona University

This comprehensive music recommendation system demonstrates the implementation and comparison of multiple recommendation approaches on a large-scale dataset. The project tackles the challenge of processing over 3.6 million user-music interactions from the Million Song Dataset combined with Spotify and Last.fm data to deliver personalized music recommendations.

Project Overview

Music recommendation systems play a crucial role in modern streaming platforms, helping users discover new content based on their listening history and preferences. This project explores both collaborative filtering and content-based recommendation techniques, addressing common challenges such as the cold start problem, data sparsity, and scalability issues inherent in large-scale recommendation systems.

Dataset and Data Processing

The project utilizes the comprehensive Million Song Dataset enhanced with Spotify and Last.fm data:

Original Dataset: 9.7 million user-music interactions
Processed Dataset: 3.65 million interactions after normalization and filtering
Users: 692,376 unique users
Music Tracks: 28,597 unique tracks with comprehensive metadata
Features: Audio features from Spotify (danceability, energy, valence, tempo, etc.)

Data Preprocessing Strategy

A sophisticated normalization approach was implemented:

User-specific playcount normalization to handle varying listening patterns
Removal of zero-normalized interactions to reduce sparsity
Preservation of all unique users while significantly reducing dataset size
Creation of sparse matrices optimized for matrix factorization algorithms

Methodology and Implementation

1. Collaborative Filtering Approaches

Multiple collaborative filtering techniques were implemented and compared:

Koren Neighborhood Model

Global weight optimization independent of specific users
Similarity weights learned through optimization rather than correlation
Performance: MSE of 0.117

Surprise Library Implementation

SVD (Singular Value Decomposition) method: MSE of 0.1107
kNN Baseline (Koren Neighborhood Model): MSE of 0.1208
Multiple algorithm comparison within a standardized framework

LightFM Recommender

Hybrid approach supporting both implicit and explicit feedback
Efficient matrix factorization for sparse data handling
Capability to work with all 3.65 million interactions
Integration of user and item metadata features

Implicit Library (Primary Method)

The core recommendation engine utilizes the Implicit library, chosen for its superior performance:

Algorithm: Bayesian Personalized Ranking (BPR)
Matrix Factorization: Alternating Least Squares (ALS)
Optimization: Fast sparse matrix computations
Scalability: Handles full dataset efficiently

2. Content-Based Filtering

A k-Nearest Neighbors cosine similarity approach was implemented:

Method: sklearn NearestNeighbors with cosine metric
Optimization: Avoided 19GB full similarity matrix by using efficient neighbor search
Features: Audio characteristics, genre tags, and temporal features
Output: Distance-based similarity recommendations

3. Hybrid Recommendation Strategy

An intelligent hybrid approach combines both methods based on user behavior:

≤3 unique songs: Content-based recommendations only (10 suggestions)
3-5 unique songs: Balanced approach (5 collaborative + 5 content-based)
>5 unique songs: Collaborative-focused (7 collaborative + 3 content-based)

Technical Implementation Details

Model Architecture

The system implements a modular architecture with the following components:

ImplicitRecommender Class: Wrapper for implicit feedback models
ArtistRetriever Class: Efficient artist and track metadata management
Custom Data Loaders: Optimized sparse matrix creation
Evaluation Framework: Train-test split with precision@k metrics

Performance Optimization

Sparse matrix operations for memory efficiency
GPU acceleration support for model training
Batch processing for large-scale recommendations
Model serialization for production deployment

Results and Performance Analysis

Model Performance Comparison

Koren Neighborhood Model: MSE 0.117
SVD (Surprise): MSE 0.1107 (best traditional method)
kNN Baseline: MSE 0.1208
Implicit BPR: Superior recommendation quality with efficient processing

Real-World Recommendation Examples

The system generates recommendations by combining collaborative filtering and content-based filtering to match both user preferences and musical similarity. For example, for a user with an alternative/indie and electronic music taste, the following tracks are recommended:

Some Kinda Love by The Velvet Underground (Collaborative Filtering Based)
Age of Consent by New Order (Collaborative Filtering Based)
I'm Sleeping in a Submarine by Arcade Fire (Collaborative Filtering Based)
Childhood Remembered by Kevin Kern (Collaborative Filtering Based)
Thieves Like Us by New Order (Collaborative Filtering Based)
Kein Mitleid by Eisbrecher (Content Based)
Tostaky (Le Continent) by Noir Désir (Content Based)
Easy Love by MSTRKRFT (Content Based)
Avantasia by Avantasia (Content Based)
Mysterious Skies by ATB (Content Based)

Collaborative filtering recommendations (e.g., The Velvet Underground, New Order, Arcade Fire) are based on users with similar listening patterns, capturing shared music tastes such as alternative rock and indie genres. Content-based recommendations (e.g., Eisbrecher, Noir Désir, MSTRKRFT, ATB) are selected for their audio features, genre tags, and stylistic similarity to the user's previous favorites, ensuring diversity and relevance even for less common preferences. This hybrid approach ensures the user receives both familiar and novel tracks that align with their unique musical profile.

Cold Start Problem Mitigation

The hybrid approach effectively addresses the cold start problem:

Over 50% of users have ≤3 song interactions, requiring content-based fallback
43.34% of users listened to >3 different tracks, enabling collaborative filtering
Seamless transition between recommendation strategies based on user activity

Technical Challenges and Solutions

Scalability Challenges

Memory Constraints: Full cosine similarity matrix would require 19GB RAM
Solution: Implemented efficient k-NN with selective neighbor computation
Processing Speed: Utilized sparse matrix operations and GPU acceleration

Data Sparsity Issues

Challenge: Most users interact with very few songs
Solution: Intelligent hybrid recommendation with user-specific thresholds
Result: Maintained recommendation quality across all user segments

Technologies and Libraries

Python: Core programming language
Implicit: Fast collaborative filtering algorithms
LightFM: Hybrid recommendation framework
Surprise: Traditional collaborative filtering benchmarks
Scikit-learn: k-NN and similarity computations
Pandas & NumPy: Data manipulation and numerical computing
SciPy: Sparse matrix operations
Jupyter: Interactive development and analysis

Key Insights and Learnings

Algorithm Performance Insights

Matrix Factorization: Implicit BPR outperformed traditional SVD approaches
Hybrid Approach: Essential for handling diverse user behavior patterns
Content Features: Audio features proved valuable for cold start scenarios
Normalization: User-specific normalization crucial for handling varying listening patterns

Practical Implementation Lessons

Data Quality: Preprocessing quality directly impacts recommendation accuracy
Scalability Design: Early architecture decisions critical for large-scale deployment
Evaluation Metrics: Multiple metrics needed for comprehensive performance assessment
User Segmentation: Different approaches needed for different user engagement levels

Future Enhancements

Deep Learning Integration: Neural collaborative filtering and autoencoders
Real-time Learning: Online learning algorithms for dynamic preference adaptation
Multi-modal Features: Integration of lyrics, audio spectrograms, and social signals
Contextual Recommendations: Time-aware and mood-based recommendation systems
Explainable AI: Interpretable recommendation explanations for users

Academic Contribution

This project demonstrates a comprehensive understanding of recommendation system principles and their practical implementation at scale. The comparative analysis of multiple algorithms, intelligent hybrid approach, and successful handling of real-world challenges like data sparsity and cold start problems showcase both theoretical knowledge and practical engineering skills essential in modern data science applications.

Recommendation Systems
Collaborative Filtering
Matrix Factorization
Content-Based Filtering

Implicit Feedback
Sparse Matrices
Hybrid Systems
Large-Scale ML