Data Science: Where Are Canadians Travelling?

Where Are Canadians Travelling To?

Despite living in the second largest country in the world, sometimes you just need to get away from things, eh? Statistics Canada provides a yearly summary table depicting the top travel destinations for Canadians in the form of the Travel Survey of Residents of Canada (TSRC). The visualizations give us some insight into our habits and how global events affect our travel plans. Visualization have all been written in R, largely using the gpplot2 package.

All code and a cleaned up data file can be found on my GitHub.

A Short Primer on the Data

I have compiled all the summary tables into a single file, as well as a CPI table and a Canadian Population table. The data is presented in yearly intervals, thus line charts will contain a single point for each year. The data collected spans from 2001-2015, and represents the top 15 countries visited by residents of Canada.

The travel data has five data entries:
– Year
– Country
– Number of Visits
– Nights Spent
– Dollar Spent in Canadian Dollars

For those with questions about the survey methodology, data imputation strategy and the assumptions (i.e. visiting multiple countries), please check out the Official CANSIM Documentation.

The Results

For those who are interested in seeing a walk through of the code, read through the results an head to “Data Science In R: Canadian Travel Data“.

First, let’s look at a table of the countries based on the number of times they appear in the top 15:

Country Frequency
China 15
Cuba 15
Dominican Republic 15
France 15
Germany 15
Italy 15
Mexico 15
Netherlands 15
Spain 15
United Kingdom 15
United States 15
Hong Kong 14
Ireland 12
Switzerland 11
Australia 6
Jamaica 5
Japan 5
Austria 4
Greece 2
Bahamas 1

Does anything here surprise you? This chart is lacking in much of the detail we need to draw conclusions. Instead, let’s take a look at the following Bump Chart. It ranks the countries as in order of number of visits over time. Note that the labels account only for the countries left in 2015, as some of the countries fall off the chart as time progresses.

BumpChartRecolor

There are some interesting take away from this chart. As probably expected, Canadians have always made the United States their number one travel destination. The United Kingdom was competing with Mexico for second place for a number of years before falling, along with France, to 5th and 6th place respectively in 2009. The Netherlands also sat comfortably at 10th place until nearly dropping off the chart in the past few years. What could be the causes of this?

Let’s try to quantify some of these trips. By using a stacked area plot on the number of trips (“Visits”) taken, we can get a sense of the relativity:

VisitsCount

Wow! The United States is overwhelmingly the largest importer of Canadian Travelers. Maybe not too surprising though, considering 75% of Canadians live within 100 miles of the American Border.

How Long Are We Staying?

For this, I’ve taken the average number of nights stayed in 2015 per visit. For interests sake I’ve also overlayed the cost of a flight to that country, chosen as the lowest result Google provides for a Wednesday flight six weeks from the writing of this article to the countries capital city. Note that for some countries, especially the United States, the cost of a flight can vary wildly, so these are far from rigorous.

AverageNights

Diving Into the US Data

Since the US represents such a large portion of the data, I thought it’d be interesting to take a deeper delve into the Canadian – American relationship. Based on the data available, we can answer a few questions:

How Many Visits Have Canadians Made to the US Over Time?

For this question, I’ve opted to show two ways of looking at the data. The unadjusted line shows the raw number of visits given by the data. I have also opted to make a population adjustment, bringing the number of visits to the 2015 level. This was done by using Statistics Canada Population Data, and using the following formula:

 Adjusted Visits 20XX = (Population of Canada 2015) / (Population of Canada 20XX) * (Number of Visits 20XX)

I have also added a couple notes which we can discuss after viewing the results:

Rplot01

You’ll notice a general upward trend, even adjusted for population, however there are some noticeable drops. I’ve made two suggestions to explain these drops:

  1. The 2008 Recession reduced the desire or ability to travel to the United States due to lack of funds
  2. The sudden drop of the $CDN from par ($1 CDN = $1 USD) around 2013 made it less appealing to travel to the US due to the cost.

Do you disagree or have additional thoughts? Leave me a comment and let me know.

How Much Are Canadians Spending While in the US?

This is a bit more tricky to answer. We have some competing forces at play here, and questions about how we calculate spending. Do we look at per night or per visit? How to we account for the exchange rate and inflation? Let’s first take a look at the raw data of average spending. Don’t forget that all the numbers are presented in Canadian Dollars.

AverageSpentUS

At face value, it looks like we have been spending more per night as time progresses. However, we can tease more information out of this data. Here is my methodology for adjusting the amount paid:

  1. Say, for example, we assume in 2002 we spent 85 Canadian Dollars
  2. The exchange rate in 2002 was about $1 USD = $1.5 CDN. Thus we spent 85/1.5 = $57 USD.
  3. Accounting for inflation at 2% per year, this is now worth $57 * (1.02)^13 = $74 USD in 2015.
  4. The exhange rate in 2015 was about $1 USD = $1.25 CDN. Thus we spent $74 * 1.25 = $92.50 CDN

To be a bit more technical, I have pulled the Canadian CPI to use as an inflation calculation. Additionally, the CAN-USD FX Rates are provided at January-1 of each year by Statistics Canada. Here is the resulting graph with these adjustments:

AdjustedSpent

This shows a much different picture. It looks like early in the decade, Canadians were spending more and more on their travels to the US, and in the more recent years they’ve been spending less. Take these results with a grain of salt – There are still many possible confounding variables and questions we could ask our self about the data, such as:

  1. Can we make an adjustment for the number of nights stayed?
    • One of the largest travel expenses for most people is likely the flight costs. The more nights you stay, the more nights this cost will be spread over.
  2. Are Canadians choosing different travel destinations within the US?
    • There are near countless vacation destinations within the US. These all come with their own costs and stay duration.

Come up with your own questions and have a try for yourself at answering some of them! You can find the code on my GitHub and a walkthrough of the code in my follow up article, “Data Science In R: Canadian Travel Data”

Sources:
Travel Data http://www.statcan.gc.ca/tables-tableaux/sum-som/l01/cst01/arts37j-eng.htm
CPI: http://www.statcan.gc.ca/tables-tableaux/sum-som/l01/cst01/econ46a-eng.htm
Population: http://www5.statcan.gc.ca/cansim/a26
FX Rates: http://www.bankofcanada.ca/rates/exchange/legacy-noon-and-closing-rates/

Advertisements

Data Science in R: Canada Travelling Code Walkthrough

Data Science in R: Canadian Travel Data Walkthrough.

For background, first read through the results in my previous blog post. All code is available on my GitHub.

Data Sources and Notes

The data for this analysis was sourced from Statistics Canada. I’ve compiled it into one file (including CPI, FX and Population) called “Travel.xlsx”, available in the GitHub folder.

Importing and Cleaning the Data

I prefer using the openxlsx package simply because it allows you to easily import Excel files without converting them to .csv first. The code is pretty straightfoward.

fromCanada <- read.xlsx("Travel.xlsx", sheet = "TravelData")

Let’s take a look at the data:

> fromCanada
                 Country Visits Nights Spent_CDN Year
1          United States  20702 216628     19929 2015
2                 Mexico   1926  25650      2306 2015
3         United Kingdom   1192  16688      1714 2015
4                 France   1140  15041      1549 2015
5                   Cuba    800   8216       708 2015
6                Germany    574   5523       542 2015
7     Dominican Republic    487   6871       561 2015
8                  China    480  10042      1009 2015
9                  Italy    469   5871       583 2015
10                 Spain    399   4993       382 2015
11  Ireland, Republic Of    254   3510       291 2015
12             Australia    245   5158       664 2015
13             Hong Kong    243   3246       319 2015
...
179              Ireland    144   1633       185 2004
180          Switzerland    142    980       105 2004
181        United States  12666  97333      8075 2003
182               Mexico    716   7375       790 2003
183       United Kingdom    684   8624       821 2003
184               France    509   6468       671 2003
185                 Cuba    495   4408       451 2003
186   Dominican Republic    415   3983       403 2003
187              Germany    331   3297       278 2003
188                Italy    248   3279       384 2003
189          Netherlands    165   1401       110 2003
190                Spain    154   2017       199 2003
191          Switzerland    125    911        98 2003
192                Japan    122   1643       193 2003
193       Mainland China    115   2245       197 2003
194              Austria    109    586        77 2003

Something you may have noticed is that we have differing names for some countries over the years. This happens often with piecemeal data like this, as surveying methods may slightly differ each year. We can use the dplyr package to clean these up:

fromCanada <- fromCanada %>%
  mutate(Country = ifelse(Country == "Republic of Ireland" | Country == "Republic Of Ireland" | Country ==  "Ireland, Republic Of", "Ireland",
                          ifelse(Country == "Mainland China", "China", Country)))

There are couple powerful tools in the dplyr package used here:

  • The %>%, or pipe tool.

While not actually a true function in itself, this pipe make the code look cleaner and improves readability. The pipe works as follows (side note – I can’t for the life of me get LaTeX to work on this WordPress.com blog. Any tips?):

x %>% f(y) = f(x,y)

As you can see, the pipe will insert whatever you put before it into whatever function comes after it. This is very useful for nested functions with many levels.

  • mutate(data, …)

The mutate function creates new columns in your data frame while also preserving the old ones. What we’ve done here is told mutate to overwrite the current value in the Country column based on the If/Else/Or logic that follows. If you had chosen another column name, say Country_new, this column would have been added to the end.

Lines 31-38 are a quick hack to assign the ranking of the countries to each year based on the number of visits. I’ll leave that one to read through yourself.

Bump Chart

Bump charts are a cool way to show changes over time, especially usefull with ranking data as we have. The ggplot2 package does a decent job of creating this, though require some adjustments to make them look nice.

ggplot(fromCanada, aes(Year,YearRank,color = Country)) +
  geom_line(size=1) +
  scale_y_continuous(breaks = c(1:15), trans = "reverse") +
  scale_x_continuous(breaks = c(2001:2015)) +
  theme(legend.position="none") +
  ggtitle("Top 15 Travel Destinations by Canadians") +
  geom_text(data = subset(fromCanada, Year == "2015"), aes(label = Country, x = Year +1.2),
            size =3.5, hjust = 0.0 ) +
  scale_color_manual(values=colors)  +
  theme(panel.background = element_blank())

BumpChartRecolor

Here are the parts that I consider the least intuitive:

  1. aes(..., color = Country)

Luckily for us, ggplot is a very smart function. By telling it to color the graph based on the the Country, it knows to group the data for each country as it’s own line. Try for yourself and see what the graph would look like without the grouping.

  1. geom_text(data = subset(fromCanada, Year == "2015"), aes(label = Country, x = Year +1.2), size =3.5, hjust = 0.60)

This one is a mouthful. Here is what it’s doing:

  • geom_text(): We want to insert text into our graph.
  • data = subset(fromCanada, Year == "2015"): We will be sourcing the text from the fromCanada dataframe, but only looking at the 2015 year.
  • aes(label = Country, x = Year +1.2): From the data, take the Country names and use that as our label. We want to place those labels on the x-axis at 2015 + 1.2 = 2016.2
  • size =3.5, hjust = 0.0: These adjust the size and horizontal placement of the labels.

Cumulative Visits Charts

Jumping down the code a bit, you’ll see that some clean-up has been done before creating the graphs. Take a look at the difference between the two following graphs:

Why does this happen? It has to do with the fact that some of the countries do not appear in all the years. Take a read through This StackExchange post for details. To fix this, we can use the xtabs function to insert zeros into the data for all missing points:

Visitsnew <-xtabs(Visits~Country+Year, fromCanada) %>%
  as.data.frame() %>%
  transmute(Country, Year, Visits = Freq)
  1. xtabs(Visits~Country+Year, fromCanada):
    • This creates a matrix of Country by Year, with Visits as the entry in the matrix. 0 is inserted if unknown.
  2. transmute(...)
    • Similar to mutate, however it throws out any columns you don’t name. Thus here we’re just renaming Freq to Visits.

Last Notes

A quick shout out again to the dplyr package – in line 226 you’ll notice the use of left_join(). While R has similar functions, I prefer the dplyr version as it is almost identical to the more familiar SQL functions. Type ?left_join into R to see all the available joins.

Want more detail on any of the code? Leave me a comment and I’ll get back to you!