Data Science in R: Canada Travelling Code Walkthrough

Posted by

Data Science in R: Canadian Travel Data Walkthrough.

For background, first read through the results in my previous blog post. All code is available on my GitHub.

Data Sources and Notes

The data for this analysis was sourced from Statistics Canada. I’ve compiled it into one file (including CPI, FX and Population) called “Travel.xlsx”, available in the GitHub folder.

Importing and Cleaning the Data

I prefer using the openxlsx package simply because it allows you to easily import Excel files without converting them to .csv first. The code is pretty straightfoward.

fromCanada <- read.xlsx("Travel.xlsx", sheet = "TravelData")

Let’s take a look at the data:

> fromCanada
                 Country Visits Nights Spent_CDN Year
1          United States  20702 216628     19929 2015
2                 Mexico   1926  25650      2306 2015
3         United Kingdom   1192  16688      1714 2015
4                 France   1140  15041      1549 2015
5                   Cuba    800   8216       708 2015
6                Germany    574   5523       542 2015
7     Dominican Republic    487   6871       561 2015
8                  China    480  10042      1009 2015
9                  Italy    469   5871       583 2015
10                 Spain    399   4993       382 2015
11  Ireland, Republic Of    254   3510       291 2015
12             Australia    245   5158       664 2015
13             Hong Kong    243   3246       319 2015
...
179              Ireland    144   1633       185 2004
180          Switzerland    142    980       105 2004
181        United States  12666  97333      8075 2003
182               Mexico    716   7375       790 2003
183       United Kingdom    684   8624       821 2003
184               France    509   6468       671 2003
185                 Cuba    495   4408       451 2003
186   Dominican Republic    415   3983       403 2003
187              Germany    331   3297       278 2003
188                Italy    248   3279       384 2003
189          Netherlands    165   1401       110 2003
190                Spain    154   2017       199 2003
191          Switzerland    125    911        98 2003
192                Japan    122   1643       193 2003
193       Mainland China    115   2245       197 2003
194              Austria    109    586        77 2003

Something you may have noticed is that we have differing names for some countries over the years. This happens often with piecemeal data like this, as surveying methods may slightly differ each year. We can use the dplyr package to clean these up:

fromCanada <- fromCanada %>%
  mutate(Country = ifelse(Country == "Republic of Ireland" | Country == "Republic Of Ireland" | Country ==  "Ireland, Republic Of", "Ireland",
                          ifelse(Country == "Mainland China", "China", Country)))

There are couple powerful tools in the dplyr package used here:

  • The %>%, or pipe tool.

While not actually a true function in itself, this pipe make the code look cleaner and improves readability. The pipe works as follows (side note – I can’t for the life of me get LaTeX to work on this WordPress.com blog. Any tips?):

x %>% f(y) = f(x,y)

As you can see, the pipe will insert whatever you put before it into whatever function comes after it. This is very useful for nested functions with many levels.

  • mutate(data, …)

The mutate function creates new columns in your data frame while also preserving the old ones. What we’ve done here is told mutate to overwrite the current value in the Country column based on the If/Else/Or logic that follows. If you had chosen another column name, say Country_new, this column would have been added to the end.

Lines 31-38 are a quick hack to assign the ranking of the countries to each year based on the number of visits. I’ll leave that one to read through yourself.

Bump Chart

Bump charts are a cool way to show changes over time, especially usefull with ranking data as we have. The ggplot2 package does a decent job of creating this, though require some adjustments to make them look nice.

ggplot(fromCanada, aes(Year,YearRank,color = Country)) +
  geom_line(size=1) +
  scale_y_continuous(breaks = c(1:15), trans = "reverse") +
  scale_x_continuous(breaks = c(2001:2015)) +
  theme(legend.position="none") +
  ggtitle("Top 15 Travel Destinations by Canadians") +
  geom_text(data = subset(fromCanada, Year == "2015"), aes(label = Country, x = Year +1.2),
            size =3.5, hjust = 0.0 ) +
  scale_color_manual(values=colors)  +
  theme(panel.background = element_blank())

BumpChartRecolor

Here are the parts that I consider the least intuitive:

  1. aes(..., color = Country)

Luckily for us, ggplot is a very smart function. By telling it to color the graph based on the the Country, it knows to group the data for each country as it’s own line. Try for yourself and see what the graph would look like without the grouping.

  1. geom_text(data = subset(fromCanada, Year == "2015"), aes(label = Country, x = Year +1.2), size =3.5, hjust = 0.60)

This one is a mouthful. Here is what it’s doing:

  • geom_text(): We want to insert text into our graph.
  • data = subset(fromCanada, Year == "2015"): We will be sourcing the text from the fromCanada dataframe, but only looking at the 2015 year.
  • aes(label = Country, x = Year +1.2): From the data, take the Country names and use that as our label. We want to place those labels on the x-axis at 2015 + 1.2 = 2016.2
  • size =3.5, hjust = 0.0: These adjust the size and horizontal placement of the labels.

Cumulative Visits Charts

Jumping down the code a bit, you’ll see that some clean-up has been done before creating the graphs. Take a look at the difference between the two following graphs:

Why does this happen? It has to do with the fact that some of the countries do not appear in all the years. Take a read through This StackExchange post for details. To fix this, we can use the xtabs function to insert zeros into the data for all missing points:

Visitsnew <-xtabs(Visits~Country+Year, fromCanada) %>%
  as.data.frame() %>%
  transmute(Country, Year, Visits = Freq)
  1. xtabs(Visits~Country+Year, fromCanada):
    • This creates a matrix of Country by Year, with Visits as the entry in the matrix. 0 is inserted if unknown.
  2. transmute(...)
    • Similar to mutate, however it throws out any columns you don’t name. Thus here we’re just renaming Freq to Visits.

Last Notes

A quick shout out again to the dplyr package – in line 226 you’ll notice the use of left_join(). While R has similar functions, I prefer the dplyr version as it is almost identical to the more familiar SQL functions. Type ?left_join into R to see all the available joins.

Want more detail on any of the code? Leave me a comment and I’ll get back to you!

Advertisements

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s