Music Recommendation System
Github Link: github.com/MetinUnlu/music-recommendation
Final project for Mining Massive Datasets course at Verona University
This comprehensive music recommendation system demonstrates the implementation and comparison of multiple recommendation approaches on a large-scale dataset. The project tackles the challenge of processing over 3.6 million user-music interactions from the Million Song Dataset combined with Spotify and Last.fm data to deliver personalized music recommendations.
Project Overview
Music recommendation systems play a crucial role in modern streaming platforms, helping users discover new content based on their listening history and preferences. This project explores both collaborative filtering and content-based recommendation techniques, addressing common challenges such as the cold start problem, data sparsity, and scalability issues inherent in large-scale recommendation systems.
Dataset and Data Processing
The project utilizes the comprehensive Million Song Dataset enhanced with Spotify and Last.fm data:
- Original Dataset: 9.7 million user-music interactions
- Processed Dataset: 3.65 million interactions after normalization and filtering
- Users: 692,376 unique users
- Music Tracks: 28,597 unique tracks with comprehensive metadata
- Features: Audio features from Spotify (danceability, energy, valence, tempo, etc.)
Data Preprocessing Strategy
A sophisticated normalization approach was implemented:
- User-specific playcount normalization to handle varying listening patterns
- Removal of zero-normalized interactions to reduce sparsity
- Preservation of all unique users while significantly reducing dataset size
- Creation of sparse matrices optimized for matrix factorization algorithms
Methodology and Implementation
1. Collaborative Filtering Approaches
Multiple collaborative filtering techniques were implemented and compared:
Koren Neighborhood Model
- Global weight optimization independent of specific users
- Similarity weights learned through optimization rather than correlation
- Performance: MSE of 0.117
Surprise Library Implementation
- SVD (Singular Value Decomposition) method: MSE of 0.1107
- kNN Baseline (Koren Neighborhood Model): MSE of 0.1208
- Multiple algorithm comparison within a standardized framework
LightFM Recommender
- Hybrid approach supporting both implicit and explicit feedback
- Efficient matrix factorization for sparse data handling
- Capability to work with all 3.65 million interactions
- Integration of user and item metadata features
Implicit Library (Primary Method)
The core recommendation engine utilizes the Implicit library, chosen for its superior performance:
- Algorithm: Bayesian Personalized Ranking (BPR)
- Matrix Factorization: Alternating Least Squares (ALS)
- Optimization: Fast sparse matrix computations
- Scalability: Handles full dataset efficiently
2. Content-Based Filtering
A k-Nearest Neighbors cosine similarity approach was implemented:
- Method: sklearn NearestNeighbors with cosine metric
- Optimization: Avoided 19GB full similarity matrix by using efficient neighbor search
- Features: Audio characteristics, genre tags, and temporal features
- Output: Distance-based similarity recommendations
3. Hybrid Recommendation Strategy
An intelligent hybrid approach combines both methods based on user behavior:
- ≤3 unique songs: Content-based recommendations only (10 suggestions)
- 3-5 unique songs: Balanced approach (5 collaborative + 5 content-based)
- >5 unique songs: Collaborative-focused (7 collaborative + 3 content-based)
Technical Implementation Details
Model Architecture
The system implements a modular architecture with the following components:
- ImplicitRecommender Class: Wrapper for implicit feedback models
- ArtistRetriever Class: Efficient artist and track metadata management
- Custom Data Loaders: Optimized sparse matrix creation
- Evaluation Framework: Train-test split with precision@k metrics
Performance Optimization
- Sparse matrix operations for memory efficiency
- GPU acceleration support for model training
- Batch processing for large-scale recommendations
- Model serialization for production deployment
Results and Performance Analysis
Model Performance Comparison
- Koren Neighborhood Model: MSE 0.117
- SVD (Surprise): MSE 0.1107 (best traditional method)
- kNN Baseline: MSE 0.1208
- Implicit BPR: Superior recommendation quality with efficient processing
Real-World Recommendation Examples
The system generates recommendations by combining collaborative filtering and content-based filtering to match both user preferences and musical similarity. For example, for a user with an alternative/indie and electronic music taste, the following tracks are recommended:
- Some Kinda Love by The Velvet Underground (Collaborative Filtering Based)
- Age of Consent by New Order (Collaborative Filtering Based)
- I'm Sleeping in a Submarine by Arcade Fire (Collaborative Filtering Based)
- Childhood Remembered by Kevin Kern (Collaborative Filtering Based)
- Thieves Like Us by New Order (Collaborative Filtering Based)
- Kein Mitleid by Eisbrecher (Content Based)
- Tostaky (Le Continent) by Noir Désir (Content Based)
- Easy Love by MSTRKRFT (Content Based)
- Avantasia by Avantasia (Content Based)
- Mysterious Skies by ATB (Content Based)
Collaborative filtering recommendations (e.g., The Velvet Underground, New Order, Arcade Fire) are based on users with similar listening patterns, capturing shared music tastes such as alternative rock and indie genres. Content-based recommendations (e.g., Eisbrecher, Noir Désir, MSTRKRFT, ATB) are selected for their audio features, genre tags, and stylistic similarity to the user's previous favorites, ensuring diversity and relevance even for less common preferences. This hybrid approach ensures the user receives both familiar and novel tracks that align with their unique musical profile.
Cold Start Problem Mitigation
The hybrid approach effectively addresses the cold start problem:
- Over 50% of users have ≤3 song interactions, requiring content-based fallback
- 43.34% of users listened to >3 different tracks, enabling collaborative filtering
- Seamless transition between recommendation strategies based on user activity
Technical Challenges and Solutions
Scalability Challenges
- Memory Constraints: Full cosine similarity matrix would require 19GB RAM
- Solution: Implemented efficient k-NN with selective neighbor computation
- Processing Speed: Utilized sparse matrix operations and GPU acceleration
Data Sparsity Issues
- Challenge: Most users interact with very few songs
- Solution: Intelligent hybrid recommendation with user-specific thresholds
- Result: Maintained recommendation quality across all user segments
Technologies and Libraries
- Python: Core programming language
- Implicit: Fast collaborative filtering algorithms
- LightFM: Hybrid recommendation framework
- Surprise: Traditional collaborative filtering benchmarks
- Scikit-learn: k-NN and similarity computations
- Pandas & NumPy: Data manipulation and numerical computing
- SciPy: Sparse matrix operations
- Jupyter: Interactive development and analysis
Key Insights and Learnings
Algorithm Performance Insights
- Matrix Factorization: Implicit BPR outperformed traditional SVD approaches
- Hybrid Approach: Essential for handling diverse user behavior patterns
- Content Features: Audio features proved valuable for cold start scenarios
- Normalization: User-specific normalization crucial for handling varying listening patterns
Practical Implementation Lessons
- Data Quality: Preprocessing quality directly impacts recommendation accuracy
- Scalability Design: Early architecture decisions critical for large-scale deployment
- Evaluation Metrics: Multiple metrics needed for comprehensive performance assessment
- User Segmentation: Different approaches needed for different user engagement levels
Future Enhancements
- Deep Learning Integration: Neural collaborative filtering and autoencoders
- Real-time Learning: Online learning algorithms for dynamic preference adaptation
- Multi-modal Features: Integration of lyrics, audio spectrograms, and social signals
- Contextual Recommendations: Time-aware and mood-based recommendation systems
- Explainable AI: Interpretable recommendation explanations for users
Academic Contribution
This project demonstrates a comprehensive understanding of recommendation system principles and their practical implementation at scale. The comparative analysis of multiple algorithms, intelligent hybrid approach, and successful handling of real-world challenges like data sparsity and cold start problems showcase both theoretical knowledge and practical engineering skills essential in modern data science applications.