How Global is Mariah Carey's 'All I Want for Christmas Is You'?

All files used for this project can be accessed through this GitHub link.

This visualisation was made in R and was designed for a UCL Data Visualisation Society workshop I led on creating interactive maps.

Mariah Carey is said to dominate the airwaves every holiday season - but how early does she start “defrosting” and how global is her reach during this period?

Visualisation and Brief Analysis

To investigate this, I created an interactive choropleth to map the song’s chart ranks in each country’s daily Spotify Top 200 chart from 1 Nov 2020 to 31 Dec 2020. There’s also an accompanying histogram to visualise how the distribution of chart ranks changes across this period.

Before attempting to make any interpretations from this map, I want to first recognise that this data is only representative of Spotify users’ listening patterns and should not be generalised to general music listening patterns. Also, we only have data for 67 countries, so this is by no means a complete global representation of the song’s performance. Now, onto the interpretations.

Predictably, the song starts its ascent in ranks as early as November and peaks in most countries a couple of days before Christmas before sharply falling off most charts within the next 3 days. But it sees a slight resurgence on 31 Dec either from New Year’s celebrations or end-of-year Spotify playlists.

We see that the song performs the best in North America and Eastern Europe, entering the charts around early November and maintaining a top 5 ranking starting from early December all the way till Christmas.

Western Europe and Asia joins the party slightly later, and the song never sees the same level of success in chart positions in Asia as it does in the West. Its rank hovers around the 50s especially in non-English speaking regions like Thailand.

For Central and South America, the song only makes an appearance in the Top 200 within the immediate vicinity of Christmas and doesn’t crack beyond the top 30s.

So while the song did have a massive global presence during the 2020 holiday season, it doesn’t dominate the Spotify charts as heavily elsewhere as it does in North America and Europe.

Data

Spotify’s Top 200 Chart data was used to create the plots as it’s one of the dominant streaming platforms across the world and it provides a comparable metric since the ranks are recorded in the same way across countries. It’s also much simpler gathering data from a single platform than combining data from different charts across the world. And there’s a readily available dataset on Kaggle, which was used for this visualisation.

To locate the data in a map, I also obtained the ISO-3166 country codes from the ISO website, which I merged with our Spotify dataset. While the plotly package supports the use of country names for plotting, it’s more reliable to use ISO codes instead.

Process

I will be using the plotly package in R to plot an interactive choropleth map of the Spotify Top 200 chart ranks for Mariah Carey’s ‘All I Want for Christmas Is You’ during the holiday season.

First, I’ll load in the libraries required and the Spotify chart data.

library(tidyverse) #for data cleaning
library(plotly) #for plotting our maps
library(extrafont) #for additional font options
library(htmlwidgets) #to export our final product

#Loading in spotify chart data from kaggle
chartdata <- read.csv("charts.csv")

Let’s do some initial exploration of the data to see what we’re working with.

dim(chartdata)

## [1] 26173514        9

head(chartdata)

##                         title rank       date
## 1     Chantaje (feat. Maluma)    1 2017-01-01
## 2 Vente Pa' Ca (feat. Maluma)    2 2017-01-01
## 3  Reggaetón Lento (Bailemos)    3 2017-01-01
## 4                      Safari    4 2017-01-01
## 5                 Shaky Shaky    5 2017-01-01
## 6                 Traicionera    6 2017-01-01
##                                  artist
## 1                               Shakira
## 2                          Ricky Martin
## 3                                  CNCO
## 4 J Balvin, Pharrell Williams, BIA, Sky
## 5                          Daddy Yankee
## 6                       Sebastian Yatra
##                                                     url    region  chart
## 1 https://open.spotify.com/track/6mICuAdrwEjh6Y6lroV2Kg Argentina top200
## 2 https://open.spotify.com/track/7DM4BPaS7uofFul3ywMe46 Argentina top200
## 3 https://open.spotify.com/track/3AEZUABDXNtecAOSC1qTfo Argentina top200
## 4 https://open.spotify.com/track/6rQSrBHf7HlZjtcMZ4S4bO Argentina top200
## 5 https://open.spotify.com/track/58IL315gMSTD37DOZPJ2hf Argentina top200
## 6 https://open.spotify.com/track/5J1c3M4EldCfNxXwrwt8mT Argentina top200
##           trend streams
## 1 SAME_POSITION  253019
## 2       MOVE_UP  223988
## 3     MOVE_DOWN  210943
## 4 SAME_POSITION  173865
## 5       MOVE_UP  153956
## 6     MOVE_DOWN  151140

Subsetting and Data Cleaning

Since there are over 20 million rows of data, I’ll filter the data before attempting to explore further. I’m looking for any observations with Mariah Carey’s ‘All I Want for Christmas Is You’ (there’s also a version by Michael Buble) in the Top 200 Charts (there’s also a Viral 50 Chart in the dataset). And I am filtering away any observations for the Global charts as I am interested only in country-specific charts.

Out of the 9 columns of data, there are only 4 that I think will be relevant to the plot:

rank - this is the value I will be using to fill the choropleth
date - I’ll need this for the animation
region - to plot the data in a map
streams - extra information to show when hovering over each region

#Filtering for "All I Want for Christmas Is You" and selecting relevant rows for plot
mariah_all <- chartdata %>% filter(title == "All I Want for Christmas Is You", artist == "Mariah Carey", chart == "top200", region != "Global") %>% select("rank", "date", "region", "streams")

dim(mariah_all)

## [1] 9045    4

As I’m only interested in the holiday season, I’ll need to filter the data by dates. To do this, I’ll format the ‘date’ column as a date object and see the range of dates we’re working with.

#Formatting 'date' column as date object
mariah_all$date <- as.Date(mariah_all$date, format = "%Y-%m-%d")
mariah_all <- mariah_all[order(mariah_all$date),]

#Checking date range
range(mariah_all$date)

## [1] "2017-01-01" "2021-12-31"

I’m only interested in plotting for one holiday season, so let’s take the most recent year, 2021.

#Subsetting for 2021
mariah.2021 <- mariah_all %>% filter(date >= '2021-01-01')
dim(mariah.2021)

## [1] 951   4

dim(mariah_all)

## [1] 9045    4

There seems to be something wrong with the 2021 data - there are only 951 rows while the original data spanning from 2017 to 2021 has 9045 rows. It appears that there may be some data missing in 2021.

Doing a quick scan through the data, it seems that there is a lot of missing data in December 2021.

#Data for 30 November 2021
nrow(mariah.2021[mariah.2021$date == '2021-11-30',])

## [1] 41

The data seems to make sense on 30 Nov 2021 - there are 41 countries where the song is in the Top 200 charts.

#Data for 25 Dec 2021
mariah.2021 %>% filter(date == '2021-12-25')

##   rank       date               region streams
## 1   72 2021-12-25            Argentina  112049
## 2    2 2021-12-25              Austria   59219
## 3    5 2021-12-25              Bolivia   12989
## 4   67 2021-12-25               Brazil  372590
## 5    2 2021-12-25             Bulgaria    6128
## 6    1 2021-12-25 United Arab Emirates   16799
## 7    4 2021-12-25        United States 3043472

But the song is only in the Top 200 charts in 7 countries on Christmas Day itself, and countries we’d expect to see like the UK or Canada are missing. This hints that the data for 2021 is incomplete.

Instead, let’s look at the data for 2020.

#Subsetting data for 2020
mariah.2020 <- mariah_all %>% filter(date >= "2020-01-01", date < "2021-01-01")
dim(mariah.2020)

## [1] 2368    4

nrow(mariah.2020[mariah.2020$date == "2020-12-25",])

## [1] 66

Now this makes way more sense… There are 2368 total observations for 2020 and there are 66 countries where the song broke the top 200 charts on Christmas Day in 2020.

So I’ll choose to plot for the 2020 holiday season instead of 2021. To decide the start date for the plot, I’ll check when the song starts gaining some traction.

#Quick plot to look at distribution across date
plot(table(mariah.2020$date))

From the plot, it appears that the song starts to gain traction towards the start of November, so I’ll plot the map starting from 01 November 2020 to 31 Dec 2020 to track the full rise of the song in 2020 and filter the data accordingly.

mariah <- mariah_all %>% filter(date >= "2020-11-01", date <"2021-01-01")

Now that we have the time frame for the plot, let’s check if there are any issues of incomplete data involving the regions.

setdiff(unique(chartdata$region), unique(mariah$region))

## [1] "Global"      "Andorra"     "South Korea"

It appears that Andorra and South Korea are not within our dataframe for the song’s chart ranks but appear in the complete dataset. Let’s investigate whether this is simply because the song did not break into the Top 200 at all in these countries in 2020 or if there’s some incomplete data once again.

#Filtering for South Korea data to find out more
skorea <- chartdata %>% filter(region == "South Korea") %>% select(-url, -region, -trend, -streams)

head(skorea)

##                                    title rank       date        artist  chart
## 1                                   Moon  192 2021-07-01           BTS top200
## 2                                 Butter    1 2021-07-01           BTS top200
## 3                             Next Level    2 2021-07-01         aespa top200
## 4                             Bad Habits    3 2021-07-01    Ed Sheeran top200
## 5                               Dynamite    4 2021-07-01           BTS top200
## 6 Peaches (feat. Daniel Caesar & Giveon)    5 2021-07-01 Justin Bieber top200

#Finding earliest date in South Korea dataframe
min(as.Date(skorea$date, format = "%Y-%m-%d"))

## [1] "2021-02-01"

It appears that Spotify only started publishing data for South Korea starting from Feb 2021.

#Filtering for Andorra data
andorra <- chartdata %>% filter(region == "Andorra") %>% select(-url, -region, -trend, -streams)

head(andorra)

##                         title rank       date
## 1                     Sirenas    1 2017-08-04
## 2                     Sirenas    1 2017-08-01
## 3    Something Just Like This    2 2017-08-01
## 4             Bella y Sensual    3 2017-08-01
## 5            Deja Que Te Bese    4 2017-08-01
## 6 Have You Ever Seen The Rain    5 2017-08-01
##                                  artist   chart
## 1                              Taburete viral50
## 2                              Taburete viral50
## 3            The Chainsmokers, Coldplay viral50
## 4 Romeo Santos, Daddy Yankee, Nicky Jam viral50
## 5          Alejandro Sanz, Marc Anthony viral50
## 6          Creedence Clearwater Revival viral50

#Finding which charts are tracked in Andorra
unique(andorra$chart)

## [1] "viral50"

As for Andorra, it seems that it only has data for the Viral 50 charts, and not the Top 200 charts.

With that, it seems that the reason for these countries being absent from the dataset for “All I Want for Christmas Is You” is simply because there’s no data available. So any countries missing from the mariah dataframe are missing because Spotify did not collect (or publish) data on them. In other words, the song appeared on the Top 200 Charts at least once from Nov to Dec 2020 in every country Spotify that had data for.

Merging with Country Code Data

Next, to locate the data, I will merge the current dataset we have with the ISO-3166 codes.

#reading in country code data (iso3166)
iso <- read.csv("iso3166_29dec2022.csv")
head(iso, n = 5)

##   English.short.name          French.short.name Alpha.2.code Alpha.3.code
## 1        Afghanistan           Afghanistan (l')           AF          AFG
## 2            Albania               Albanie (l')           AL          ALB
## 3            Algeria            Alg\xe9rie (l')           DZ          DZA
## 4     American Samoa Samoa am\xe9ricaines (les)           AS          ASM
## 5            Andorra               Andorre (l')           AD          AND
##   Numeric
## 1       4
## 2       8
## 3      12
## 4      16
## 5      20

I’ll only need to keep the ‘English.short.name’ and ‘Alpha.3.code’ columns, the former for matching purposes and the latter is the ISO 3 letter country code that we need for plotting.

#Cleaning up iso3166 data
iso_clean <- iso %>% select("English.short.name", "Alpha.3.code")
colnames(iso_clean) <- c("region", "iso3")

Since I’ll be merging the two datasets by region names, I’ll first check if there are any disparities in region names between the two dastasets.

setdiff(unique(mariah$region), unique(iso_clean$region))

##  [1] "Netherlands"          "United Kingdom"       "United States"       
##  [4] "Czech Republic"       "Philippines"          "United Arab Emirates"
##  [7] "Vietnam"              "Dominican Republic"   "Taiwan"              
## [10] "Russia"               "Bolivia"              "Turkey"

It appears that there are 13 regions that are differently named. I searched for the corresponding names in the ISO-3166 data manually (please let me know if there’s a more efficient method), and recoded the names in the ISO data with those from the Spotify data. I decided to use the region names from Spotify for simplicity’s sake as the names in the ISO data are the official, longer, and less commonly used versions.

The only exception where I used the region name from the ISO data is for Türkiye as this is now their official name, while the 2020 Spotify data still uses their old name.

#Recoding region names in the iso data using region names from the spotify data
iso_clean <- iso_clean %>% 
  dplyr::mutate(region = recode(region, "Netherlands(the)" = "Netherlands",
                                        "Czechia" = "Czech Republic",
                                        "Dominican Republic(the)" = "Dominican Republic",
                                        "United Kingdom of Great Britain and Northern Ireland (the)" = "United Kingdom",
                                        "United States of America (the)" = "United States",
                                        "Philippines (the)" = "Philippines", 
                                        "Russian Federation (the)" = "Russia", 
                                        "Taiwan (Province of China)" = "Taiwan", 
                                        "United Arab Emirates (the)" = "United Arab Emirates", 
                                        "Viet Nam" = "Vietnam", 
                                        "Bolivia (Plurinational State of)" = "Bolivia",
                                        "T\xfcrkiye" = "Türkiye")) #formatting of Türkiye was lost somewhere, so I'm recoding it as well

#Recoding 'Turkey' in the 2020 Spotify data
mariah <- mariah %>% dplyr::mutate(region = recode(region, "Turkey" = "Türkiye"))

Now that the region names match up, I can finally merge the two datasets to incorporate the ISO-3166 country codes into our Spotify data.

mariah_iso <- left_join(mariah, iso_clean, by = "region")

Creating a new variable for the choropleth

With the data now cleaned and merged, let’s try to plot an initial map to see how it looks before customising further.

mariah_test <- mariah_iso

#Converting date column to character as plotly can't take in date data as an input
mariah_test$Date <- as.character(mariah_test$date)

#Creating plot
plot_geo(mariah_test,            
         frame = ~Date,       #column used for animation frames.
         locations = ~iso3,   #column used to plot locations
         z = ~rank,           #column used to fill the map in
         zmin = 1,            #sets minimum rank as 1
         zmax = 200,          #sets maximum rank as 200
         colorscale = 'Reds', #determines colour
         reversescale = T)    #reverses scale so that the colour gets more intense the lower the rank

There are two glaring issues I can spot here.

As the colour is spread out evenly from 1-200, this makes it difficult to visually spot differences in chart ranks towards the top of the charts, and it so happens that most of the ranks are clustered around the top.
We can’t visually tell the difference between a country where the song is not on the Top 200 chart and a country which we do not have Spotify data for. They are both coloured white and cannot be hovered over. This is because they’re both missing from the chart data.

hist(mariah_iso$rank)

Looking at the distribution of the chart ranks for the song, most of them are clustered around 1-20, and it would not make sense for the color scale to be evenly spread out between 1-200 as it obscures the differences near the top.

To solve this, I will group the ranks into 7 categories: 1, 2-5, 6-10, 11-20, 21-50, 51-100, 101-200, and plot using these categories instead of the actual numeric rank. I’m also adding an 8th category, <200, so that we can plot instances where the song has fallen out of the Top 200 charts.

#Renaming 'rank' to "actual.rank" as we will be creating a "Rank" variable later on
colnames(mariah_iso)[colnames(mariah_iso) == "rank"] <- "actual.rank"

#Creating breaks
mariah_iso$breaks <- cut(mariah_iso$actual.rank, breaks = c(0, 1, 5, 10, 20, 50, 100, 200))
levels(mariah_iso$breaks) <- c("1", "2-5", "6-10", "11-20", "21-50", "51-100", "101-200", "<200")
#Reversing factors. Rank 1 should have the highest value when plotting the map
mariah_iso$breaks <- fct_rev(mariah_iso$breaks)

#Checking the distribution of the categories
plot(table(mariah_iso$breaks))

We can see that there is now a more even distribution across the rank categories.

Moving onto the second issue, in order to differentiate between the two instances where

the song isn’t in the Top 200 Chart for the country, and
there isn’t chart data for the country,

I will create extra rows in our dataset such that there is a row for every combination of region and date within our dataset. For dates when a region has no data recorded, their rank is coded as “Not in the Top 200 Charts”. I will also fill in values for the actual.rank and streams columns so that these can be displayed when hovered over.

#Converting into columns into characters to enable filling in of values
mariah_iso$actual.rank <- as.character(mariah_iso$actual.rank)
mariah_iso$streams <- as.character(mariah_iso$streams)

#Creating rows for missing region and missing dates so that we have a row for every combination of region and date
mariah.complete <- mariah_iso %>% 
                    complete(date, nesting(region, iso3), fill = list(actual.rank = "<i>Not in the Top 200 chart</i>", streams = "<i>Not in the Top 200 chart</i>", breaks = "<200"))

As plotly can’t seem to plot categorical variables, I will create a new column named Rank that stores the numeric values for these factors, which I will use for the plot.

#Creating a new column of the numeric values of each category. This will be used for the plot
mariah.complete$Rank <- as.numeric(mariah.complete$breaks)

#Converting date column to character as plotly can't take in date data as an input
mariah.complete$Date <- as.character(mariah.complete$date)

Plotting the choropleth map

Now that the data is finally ready for plotting, all that’s left is to customise how I want the map to look like. Here, I’m creating a couple of lists with arguments that customise different portions of the plot, these lists will be plugged into the plot_geo function when plotting the map.

As plotly recognises html tags, I will also be using them to format the text shown in the plot.

#Sets what information is shown when hovering over the countries
mariah.complete$hover <- with(mariah.complete,paste('<b>',region,'</b>','<br>',
                           'Chart Rank:', actual.rank, '<br>',
                           'Daily Streams:', streams))

#Customises hover layout
hover.layout <- list(
  bgcolor = "#126c50",                     
  bordercolor = "transparent",             
  font = list(size = 10,                  
              color = "white")
)

#Specifying tick positions along the color bar
tick.positions <- vector(length = 8)
for(i in 1:8){
    tick.positions[i] <- (7/8)*0.5 + (7/8)*(i-1) + 1
}

#Specify map projections and options
g <- list(
  framecolor = "#b19f98",     #Color of frame around the map
  projection = list(type = "Mercator"), 
  coastlinecolor = "#b19f98",  #Color of coastlines around countries
  coastlinewidth = 0.5,        #Width of coastlines
  showland = T,                #To color the countries not inside the dataset
  landcolor = "#faf6ed",       
  showocean = T,               #To color the ocean
  oceancolor = "#faf6ed")

#Specifies the boundary characteristics of countries that are inside our dataset
l <- list(line = list(color = "#b19f98", width = 0.5))

#Customises date value layout
date.layout <- list(font = list(size = 20,
                                color = "#126c50"))

And finally, let’s plot the map.

#Creating the plotly map
mariah.map <- plot_geo(mariah.complete,            
         frame = ~Date,            
         locations = ~iso3,         
         z = ~Rank,
         zmin = 1,  #Specify zmin and zmax to keep the color bar constant throughout the animation
         zmax = 8,
         color = ~Rank,      
         colorscale = "Reds",
         text = ~hover,
         hoverinfo = 'text',  #To control what text is shown when hovering over countries
         showlegend = F,  #Remove legend to create more space for the color bar
         marker = l) %>%       
  
  layout(geo = g, 
         margin = list(t = 50), #Adds top margins to the plot
         paper_bgcolor = "#faf5e7",  #Sets background colour
         font = list(color = "#b42d1e",
                     family = "Roboto Condensed"),  
         title = "<b>Popularity of 'All I Want for Christmas Is You' on Spotify across the 2020 festive season</b><br><i>Source: <a href = 'https://www.kaggle.com/datasets/dhruvildave/spotify-charts?select=charts.csv'>Spotify Top 200 Charts</a></i>") %>%  
  
  style(hoverlabel = hover.layout)  %>%
  
  colorbar(tickvals = tick.positions,  #Specifies where the ticks appear along the color bar
           ticktext = levels(mariah.complete$breaks),  #Specifies what labels are shown
           tickfont = list(size = 13),
           tickcolor = "#b42d1e",
           outlinecolor = "transparent") %>%
  
  animation_slider(font = list(color = "#126c50"),   #Customising how the animation slider looks
                   currentvalue = date.layout,   
                   bgcolor = "#126c50",             
                   tickcolor = "#126c50") %>%
  
  animation_button(font = list(text = "<b>Play</b>",  #Customising how the Play button looks
                               color = "#126c50",
                               size = 14))

mariah.map

We can also export the map as a html widget. The partial_bundle() function is used to reduce the file size of the output.

#To export the plotly map
saveWidget(partial_bundle(mariah.map), "map.html")

While the map is useful to show the geographical unevenness of chart ranks, it can be slightly difficult to track temporal changes and the overall distribution of chart ranks across time. To solve this, I decided to create an accompanying histogram to track temporal changes in chart rank distribution more easily.

Plotting an accompanying histogram

mariah.hist <- plot_ly(mariah.complete,
        x = ~breaks,
        type = "histogram",
        frame = ~Date,
        showlegend = F,
        texttemplate = "<b>%{y}</b>",     #to add frequency count above bars
        textposition = "outside",
        marker = list(color = "#b42d1e")) %>% #to specify color of bars
  
  layout(paper_bgcolor = "#faf5e7",  #to specify color of paper
         plot_bgcolor = "#faf5e7",   #to specify color of plot background
         margin = list(t = 50),      #to add top margins to the plot  
         font = list(family = "Roboto Condensed",   #controls font family and color
                     color = "#b42d1e"),
         title = "<b>Distribution of Spotify Chart Ranks for 'All I Want for Christmas Is You' across 67 regions</b><br><i>Source: <a href = 'https://www.kaggle.com/datasets/dhruvildave/spotify-charts?select=charts.csv'>Spotify Top 200 Charts</a></i>",
         xaxis = list(title = "",            #removes x-axis title
                      range = c(-1, 8)),     #sets zoom window of the plot
         yaxis = list(range = c(0, 59))) %>%  
  
  animation_slider(font = list(color = "#126c50"),  #controls animation slider visuals
                   currentvalue = date.layout,   
                   bgcolor = "#126c50",             
                   tickcolor = "#126c50") %>%
  
  animation_button(font = list(text = "<b>Play</b>", #controls animation button visuals
                               size = 14,
                               color = "#126c50"))

mariah.hist

To export the histogram, I’ll just follow the same steps as above.

#Exporting histogram
saveWidget(partial_bundle(mariah.hist), "hist.html")

And we’re done!