It’s frustrating to listen to a song you like, forget the words, and never hear it again. Or to enjoy to your Spotify “Discover Weekly” playlist only for it to disappear the next week. I built a system to fix these issues: https://spotify.suczewski.com.
- saves all plays
- saves all “Discovery Weekly” playlists
- saves your top songs & artists of the last month, 6 months, and all-time & how each changes over time
Tracking all of this information has the nice effect of serving as a personal timeline. I associate certain songs & artists with times in my life, and the chronology here is an implicit journal that provides hints to memorable life moments.
In other words, why did I listen to The Band’s “When I Paint My Masterpiece” 73 times in a row between 11pm & 5am on 10/3/2018 - 10/4/2018? Why was “Mele Kalikimaka” my Christmas song of 2018? When was I listening to new music vs. re-playing the favorites?
The app runs on https://spotify.suczewski.com. The app has 3 pages that correspond to the 3 pieces of information the system tracks: plays, “Discovery Weekly”, and top songs & artists.
Here’s how it’s built.
I bought suczewski.com on GoDaddy and manage DNS on Cloudflare. The
A record points to this blog, which is hosted on GitLab Pages. The
CNAME subdomain points to an AWS Cloudfront front address behind which is an API Gateway deployment (docs).
The main challenge here was getting SSL certificates sorted out, especially for GitLab pages. This article is great though and gives the critical tip that “CloudFlare doesn’t combine both PEM and root certificates in one, so we need to copy the root certificate (aka “intermediate”) CloudFlare Origin CA — RSA Root … and paste it below your certificate (PEM).”
spotify.suczewski.com requests are routed to API Gateway.
- Requests with path
/api/*are proxied to AWS Lambda.
- Requests with path
/static/*are proxied to AWS S3.
The main challenge here was wrapping my head around API Gateway which was new to me. Documentation is pretty good, but I got stuck on some details. For example, how to get API Gateway to send some Content-Types as binary (pngs) but some as text (html, js). For this, I just created 2 separate API Gateway resources,
/static/images, and configured them separately. There were a couple of other tricks like this where I was 90% done but struggled to finish the last 10% and settled for “ok” solutions.
Also, it’s unnecessary to proxy the static content through API Gateway. I could just serve it through Cloudfront. I chose the API Gateway approach because I have a small app with very low API Gateway costs, liked the idea of managing all routes in one place, and didn’t want to think about CORS issues related to hosting static content and a json api on different subdomains.
Nothing fancy. I put my Vue JS bundles, html, & images in there. Access through API Gateway means the bucket can stay private and access is granted through IAM roles.
In future work, I may add CSV / SQL exports here & abandon the read Lambda to reduce DB load and keep the system running cheaply.
I’ve used React extensively so wanted to give Vue a try. It’s generally similar to React. I like that it’s a bit more opinionated than React and often has a “correct” way of doing things. I don’t like that it encourages data mutability. My Vue code was more concise than my React code which was nice. I also like that Vue encourages keeping html, css, and js all in one file which promotes modular thinking.
For an important project, I would pick React over Vue simply for its wider adoption & better tooling. Each framework has its technical pros and cons and it’s possible to write good and bad, mutable and immutable, concise and verbose implementations in each. Both can be good choices.
For my next fun project, I’d like to use ReasonML. I am a fan of OCaml and functional programming and this is a cool thing. I have found using Flow types unsatisfying mainly because the type safety is brittle and often doesn’t work with 3rd party libraries. Using Flow adds verbosity and overhead to codebases that outweighs its partial type safety & code-as-documentation benefits. The paradigm I generally use now in JS is to rely heavily on libraries like Lodash to take care of
null checks and to access data safely. This was especially critical prior to React 16’s
componentDidCatch() where a bad
undefined error could break an entire application. ReasonML and its strict typing seem promising!
Will consider writing more on this later!
The read Lambda supports the following API. The Vue app hits this API. There are 6 salient endpoints for fetching normalized data. The
POST endpoints are actually read operations despite the HTTP verb. Using
POST enables sending our ids in the request body - requests bodies are technically allowed in
GET requests but many tools don’t support them,
GET has length restrictions, etc. (Stack Overflow). I could rename these endpoints to make it more clear they are reads, not writes.
GET|/v0/plays/:user_id: get list of recently played track ids for
user_id. Endpoint supports paging.
GET|/v0/dw/:user_id: get list of “Discover Weekly” playlists for
user_id. A playlist is a list of track ids.
GET|/v0/top/:user_id: get list of top track ids and artist ids for
POST|/v0/tracks: pass in
track_idsto batch get tracks. A track has a name, artist id, album id, and other fields.
artist_idsto batch get artists. An artist has a name and other fields.
album_idsto batch get albums. An album has a name, artist ids, track ids, and other fields.
I wrote the server with ExpressJS. Originally, I was hosting it on AWS ElasticBeanstalk, but that was costing me ~$20 / month to run an EC2 instance constantly. That’s more expensive than my Spotify subscription! To lower costs, I migrated the Express app to a single Lambda function which was surprisingly trivial.
I thought I would have to write (and was excited to write!) a Lambda wrapper for Express. The kind souls at AWS Labs already did this though and aws-serverless-express was great.
Migrating off EC2 / Elastic Beanstalk to Lambda had the negative affect of breaking my in-memory cache. My frontend relies on normalized batch requests for tracks, artists, and albums, which are beautifully cache-able since track / artist / album is shared across many pages & many users of the app. Previously, I cached this in a in-memory object in my JS code which was easy and effective. Caching it made my app faster and limited the number of reads to DynamoDB. Since Lambdas don’t have persistent memory, my in-memory cache doesn’t do anything in a Lambda environment. This is a problem because with DynamoDB, you pay for read throughput and I don’t want to pay more.
So, right now, every page load results in a couple more reads to DynamoDB than I would like. If I just want to reduce the number of reads and am really stubborn about not paying for any more read capacity I could:
- modify my frontend to request content less aggressively
- stand up a Memcached / Redis layer in front of DynamoDB for ~$10 / month
- add caching to API Gateway. This isn’t super useful right now since batch requests have different request payloads across users since people listen to different combinations of songs, artists, etc.
- add caching to API Gateway & change request structure. Don’t use batch get endpoints and instead make clients make many more network requests so we can take advantage of API Gateway caching.
- pay for a RDS instance so I care less about read throughput
- build a different frontend flow that doesn’t use this JSON API entirely. Instead only allow users to download exportable CSV / SQL files from S3. I talk a little more about this in the “DynamoDB” section. This is the option I am most likely to pursue.
The mitochondria of the system. There are 3 Lambda functions: one for all plays, one for “Discovery Weekly”, and one for your top songs & artists. The Lambdas use scheduled CloudWatch events to hit the Spotify API regularly to pull all data and write it to DynamoDB. These same Lambdas can also be triggered by SNS events. I do this when a new user signs up so that their data shows up the Vue app without waiting for the next scheduled job.
The main tricky part here was around tracking all plays on Spotify. Spotify has an endpoint that lets you access your 50 most recent plays. Spotify counts a “play” as listening to a song for 30 seconds or more. This means that one could conceivably listen to 50 songs in 25 minutes. If I want to track all plays, I need to hit Spotify at least every 25 minutes. I set up this Lambda to run at *:00, *:20, *:40 so I hit Spotify frequently enough and have nice, predictably timestamps.
This part of the system will become a bottleneck if too many people use this app. Here are 3 bottlenecks:
Many users means more DB writes and I don’t want to pay for more write units. Solution: just buy more write units or pay for a relational DB. Alternatively, just increase write capacity temporarily during scheduled jobs. This is a problem though if I change the lambda schedule so there is usually at least 1 job running.
Many users means my Lambda takes longer. Lambdas must take under 15 minutes. Solution: change the structure of my Lambda so it doesn’t call the Spotify API for all users in a single function. Use a distributed queue (SQS), more lambdas, etc. to spread work out.
Many users means many requests to the Spotify API and they could throttle me. Solution: Ask Spotify for more request volume. Or, change the Lambda schedule so that I don’t bombard Spotify with requests for all users at *:00, *:20, *:40. Instead, spread users out so that everyone has a different offset - user 1 goes at *:00:00, *:20:00, *:40:00, user 2 goes at *:00:30, *:20:30, *:40:30, user 3 goes at *:01:00, *:21:00, *:41:00, etc. Note, using the https://spotify.suczewski.com frontend does not result in any requests to the Spotify API aside from static assets. The frontend runs entirely off DynamoDB. Requests to the Spotify API come only through the scheduled Lambdas & SNS events on new user registration.
Here’s my Spotify API usage graphs. It requires ~700 requests / day to run this app for 9 users. On Mondays, I fetch everybody’s “Discovery Weekly” playlist and top songs & artists, so there’s some 7 day periodicity in the API usage graphs.
I used Amazon’s hosted NRDB, DynamoDB. I chose this mainly for cost reasons. In my architecture, RDS would be $13 / month and Dynamo would be $3 / month.
The smallest RDB, a
t2.micro costs $0.018 / hour or ~$13 / month. This must always be running to support the app.
Conversely, the cost of a Dynamo table comes primarily from the read / write throughput costs. A Dynamo read unit costs $0.00013 / hour or $0.09 / month and a write unit costs $0.00065 / hour or $0.47 / month. You must have at least one read unit and one write unit per database table so a table costs $0.56 / month. I have 5 tables, so my DB costs $2.80 / month. There are additional costs for the amount of data you have stored (I don’t have a lot) and table indexes (I don’t have a lot).
I architected my application to minimize read / write throughput. All of my tables currently only require 1 read and 1 write unit. This is why the batch get endpoints and initial Elastic Beanstalk / EC2 in-memory cache was great - it let me keep DB costs super low. In the new Lambda architecture, I will have to consider paying for more read units, adding caching, or moving to RDS if app usage grows.
A radically different approach I could take would be to take my DB out of the production frontend line of fire and use DynamoDB data to build CSV / SQL data exports in S3 asynchronously. I would abandon the complicated JSON API “single page app” and build an app to just let users download their CSV / SQL. Users could then use CSV / SQL explorers to do what they want with their data. This is the most likely infrastructural thing I will pursue since it seems like a great way to make this accessible to more people at a low cost. I could keep the JSON API & Vue UI up for a restricted set of users. I like this because it gives folks the important data they want and the system requires less ongoing maintenance from me. I don’t intend to devote more time to this project moving forward outside of visualizations & data insights, so this could be a nice hands-off approach to keep things going.
I wanted static configuration to manage my infrastructure. Declarative infrastructure means that in a few months / years if I revisit this project, I don’t have to remember which sequence of actions I took in the AWS Console to build my house of cards. I’ve used AWS’s CloudFormation before, but wanted to give Terraform a try for this project. I’m a fan!
I’ve built the scaffolding and have API Gateway / Lambda / Custom Domain infrastructure set up. I also have a code build / deployment flow set up. I store build artifacts in S3 and deploy them with Terraform. Here’s a working Terraform-managed endpoint: https://api.suczewski.com/v0/hello-world. Working on getting the rest of the content in this article into Terraform too.
I’ve been running this system for 6 months and it has been fun! The next frontiers for continued development are:
- insights: learn! build more sophisticated data exploration / querying / export tools. Now is the fun part - what can we discover?
- visualization: build cool visualizations around all the data I’m tracking - song, artist, & album listening history. Infinite possibilities. A few in mind!
- open source: clean the code up & open source it
- scale: consider super-scalable solution. Do not hit DynamoDB from frontend. Only support S3 data exports.
- scale: scale the system so that it can support O(1000) users 1) with infrastructure costs at ~$10 / month and 2) without Spotify throttling me
- bugs: fix empty thumbnail images, occasionally empty rows
In all likelihood, I will move on to other projects and just do the first two at some point later - ad-hoc querying & visualization as data continues to flow in. The insights, visualizations, and good memories are why I started the project so I will likely spend more time on that, the fun part! Article with cool visualizations and findings at some point. I may also pursue the super-scalable, S3-only solution.