Data Science: Where Are Canadians Travelling?

Where Are Canadians Travelling To?

Despite living in the second largest country in the world, sometimes you just need to get away from things, eh? Statistics Canada provides a yearly summary table depicting the top travel destinations for Canadians in the form of the Travel Survey of Residents of Canada (TSRC). The visualizations give us some insight into our habits and how global events affect our travel plans. Visualization have all been written in R, largely using the gpplot2 package.

All code and a cleaned up data file can be found on my GitHub.

A Short Primer on the Data

I have compiled all the summary tables into a single file, as well as a CPI table and a Canadian Population table. The data is presented in yearly intervals, thus line charts will contain a single point for each year. The data collected spans from 2001-2015, and represents the top 15 countries visited by residents of Canada.

The travel data has five data entries:
– Year
– Country
– Number of Visits
– Nights Spent
– Dollar Spent in Canadian Dollars

For those with questions about the survey methodology, data imputation strategy and the assumptions (i.e. visiting multiple countries), please check out the Official CANSIM Documentation.

The Results

For those who are interested in seeing a walk through of the code, read through the results an head to “Data Science In R: Canadian Travel Data“.

First, let’s look at a table of the countries based on the number of times they appear in the top 15:

Country Frequency
China 15
Cuba 15
Dominican Republic 15
France 15
Germany 15
Italy 15
Mexico 15
Netherlands 15
Spain 15
United Kingdom 15
United States 15
Hong Kong 14
Ireland 12
Switzerland 11
Australia 6
Jamaica 5
Japan 5
Austria 4
Greece 2
Bahamas 1

Does anything here surprise you? This chart is lacking in much of the detail we need to draw conclusions. Instead, let’s take a look at the following Bump Chart. It ranks the countries as in order of number of visits over time. Note that the labels account only for the countries left in 2015, as some of the countries fall off the chart as time progresses.

BumpChartRecolor

There are some interesting take away from this chart. As probably expected, Canadians have always made the United States their number one travel destination. The United Kingdom was competing with Mexico for second place for a number of years before falling, along with France, to 5th and 6th place respectively in 2009. The Netherlands also sat comfortably at 10th place until nearly dropping off the chart in the past few years. What could be the causes of this?

Let’s try to quantify some of these trips. By using a stacked area plot on the number of trips (“Visits”) taken, we can get a sense of the relativity:

VisitsCount

Wow! The United States is overwhelmingly the largest importer of Canadian Travelers. Maybe not too surprising though, considering 75% of Canadians live within 100 miles of the American Border.

How Long Are We Staying?

For this, I’ve taken the average number of nights stayed in 2015 per visit. For interests sake I’ve also overlayed the cost of a flight to that country, chosen as the lowest result Google provides for a Wednesday flight six weeks from the writing of this article to the countries capital city. Note that for some countries, especially the United States, the cost of a flight can vary wildly, so these are far from rigorous.

AverageNights

Diving Into the US Data

Since the US represents such a large portion of the data, I thought it’d be interesting to take a deeper delve into the Canadian – American relationship. Based on the data available, we can answer a few questions:

How Many Visits Have Canadians Made to the US Over Time?

For this question, I’ve opted to show two ways of looking at the data. The unadjusted line shows the raw number of visits given by the data. I have also opted to make a population adjustment, bringing the number of visits to the 2015 level. This was done by using Statistics Canada Population Data, and using the following formula:

 Adjusted Visits 20XX = (Population of Canada 2015) / (Population of Canada 20XX) * (Number of Visits 20XX)

I have also added a couple notes which we can discuss after viewing the results:

Rplot01

You’ll notice a general upward trend, even adjusted for population, however there are some noticeable drops. I’ve made two suggestions to explain these drops:

  1. The 2008 Recession reduced the desire or ability to travel to the United States due to lack of funds
  2. The sudden drop of the $CDN from par ($1 CDN = $1 USD) around 2013 made it less appealing to travel to the US due to the cost.

Do you disagree or have additional thoughts? Leave me a comment and let me know.

How Much Are Canadians Spending While in the US?

This is a bit more tricky to answer. We have some competing forces at play here, and questions about how we calculate spending. Do we look at per night or per visit? How to we account for the exchange rate and inflation? Let’s first take a look at the raw data of average spending. Don’t forget that all the numbers are presented in Canadian Dollars.

AverageSpentUS

At face value, it looks like we have been spending more per night as time progresses. However, we can tease more information out of this data. Here is my methodology for adjusting the amount paid:

  1. Say, for example, we assume in 2002 we spent 85 Canadian Dollars
  2. The exchange rate in 2002 was about $1 USD = $1.5 CDN. Thus we spent 85/1.5 = $57 USD.
  3. Accounting for inflation at 2% per year, this is now worth $57 * (1.02)^13 = $74 USD in 2015.
  4. The exhange rate in 2015 was about $1 USD = $1.25 CDN. Thus we spent $74 * 1.25 = $92.50 CDN

To be a bit more technical, I have pulled the Canadian CPI to use as an inflation calculation. Additionally, the CAN-USD FX Rates are provided at January-1 of each year by Statistics Canada. Here is the resulting graph with these adjustments:

AdjustedSpent

This shows a much different picture. It looks like early in the decade, Canadians were spending more and more on their travels to the US, and in the more recent years they’ve been spending less. Take these results with a grain of salt – There are still many possible confounding variables and questions we could ask our self about the data, such as:

  1. Can we make an adjustment for the number of nights stayed?
    • One of the largest travel expenses for most people is likely the flight costs. The more nights you stay, the more nights this cost will be spread over.
  2. Are Canadians choosing different travel destinations within the US?
    • There are near countless vacation destinations within the US. These all come with their own costs and stay duration.

Come up with your own questions and have a try for yourself at answering some of them! You can find the code on my GitHub and a walkthrough of the code in my follow up article, “Data Science In R: Canadian Travel Data”

Sources:
Travel Data http://www.statcan.gc.ca/tables-tableaux/sum-som/l01/cst01/arts37j-eng.htm
CPI: http://www.statcan.gc.ca/tables-tableaux/sum-som/l01/cst01/econ46a-eng.htm
Population: http://www5.statcan.gc.ca/cansim/a26
FX Rates: http://www.bankofcanada.ca/rates/exchange/legacy-noon-and-closing-rates/

Advertisements

Data Science in R: Canada Travelling Code Walkthrough

Data Science in R: Canadian Travel Data Walkthrough.

For background, first read through the results in my previous blog post. All code is available on my GitHub.

Data Sources and Notes

The data for this analysis was sourced from Statistics Canada. I’ve compiled it into one file (including CPI, FX and Population) called “Travel.xlsx”, available in the GitHub folder.

Importing and Cleaning the Data

I prefer using the openxlsx package simply because it allows you to easily import Excel files without converting them to .csv first. The code is pretty straightfoward.

fromCanada <- read.xlsx("Travel.xlsx", sheet = "TravelData")

Let’s take a look at the data:

> fromCanada
                 Country Visits Nights Spent_CDN Year
1          United States  20702 216628     19929 2015
2                 Mexico   1926  25650      2306 2015
3         United Kingdom   1192  16688      1714 2015
4                 France   1140  15041      1549 2015
5                   Cuba    800   8216       708 2015
6                Germany    574   5523       542 2015
7     Dominican Republic    487   6871       561 2015
8                  China    480  10042      1009 2015
9                  Italy    469   5871       583 2015
10                 Spain    399   4993       382 2015
11  Ireland, Republic Of    254   3510       291 2015
12             Australia    245   5158       664 2015
13             Hong Kong    243   3246       319 2015
...
179              Ireland    144   1633       185 2004
180          Switzerland    142    980       105 2004
181        United States  12666  97333      8075 2003
182               Mexico    716   7375       790 2003
183       United Kingdom    684   8624       821 2003
184               France    509   6468       671 2003
185                 Cuba    495   4408       451 2003
186   Dominican Republic    415   3983       403 2003
187              Germany    331   3297       278 2003
188                Italy    248   3279       384 2003
189          Netherlands    165   1401       110 2003
190                Spain    154   2017       199 2003
191          Switzerland    125    911        98 2003
192                Japan    122   1643       193 2003
193       Mainland China    115   2245       197 2003
194              Austria    109    586        77 2003

Something you may have noticed is that we have differing names for some countries over the years. This happens often with piecemeal data like this, as surveying methods may slightly differ each year. We can use the dplyr package to clean these up:

fromCanada <- fromCanada %>%
  mutate(Country = ifelse(Country == "Republic of Ireland" | Country == "Republic Of Ireland" | Country ==  "Ireland, Republic Of", "Ireland",
                          ifelse(Country == "Mainland China", "China", Country)))

There are couple powerful tools in the dplyr package used here:

  • The %>%, or pipe tool.

While not actually a true function in itself, this pipe make the code look cleaner and improves readability. The pipe works as follows (side note – I can’t for the life of me get LaTeX to work on this WordPress.com blog. Any tips?):

x %>% f(y) = f(x,y)

As you can see, the pipe will insert whatever you put before it into whatever function comes after it. This is very useful for nested functions with many levels.

  • mutate(data, …)

The mutate function creates new columns in your data frame while also preserving the old ones. What we’ve done here is told mutate to overwrite the current value in the Country column based on the If/Else/Or logic that follows. If you had chosen another column name, say Country_new, this column would have been added to the end.

Lines 31-38 are a quick hack to assign the ranking of the countries to each year based on the number of visits. I’ll leave that one to read through yourself.

Bump Chart

Bump charts are a cool way to show changes over time, especially usefull with ranking data as we have. The ggplot2 package does a decent job of creating this, though require some adjustments to make them look nice.

ggplot(fromCanada, aes(Year,YearRank,color = Country)) +
  geom_line(size=1) +
  scale_y_continuous(breaks = c(1:15), trans = "reverse") +
  scale_x_continuous(breaks = c(2001:2015)) +
  theme(legend.position="none") +
  ggtitle("Top 15 Travel Destinations by Canadians") +
  geom_text(data = subset(fromCanada, Year == "2015"), aes(label = Country, x = Year +1.2),
            size =3.5, hjust = 0.0 ) +
  scale_color_manual(values=colors)  +
  theme(panel.background = element_blank())

BumpChartRecolor

Here are the parts that I consider the least intuitive:

  1. aes(..., color = Country)

Luckily for us, ggplot is a very smart function. By telling it to color the graph based on the the Country, it knows to group the data for each country as it’s own line. Try for yourself and see what the graph would look like without the grouping.

  1. geom_text(data = subset(fromCanada, Year == "2015"), aes(label = Country, x = Year +1.2), size =3.5, hjust = 0.60)

This one is a mouthful. Here is what it’s doing:

  • geom_text(): We want to insert text into our graph.
  • data = subset(fromCanada, Year == "2015"): We will be sourcing the text from the fromCanada dataframe, but only looking at the 2015 year.
  • aes(label = Country, x = Year +1.2): From the data, take the Country names and use that as our label. We want to place those labels on the x-axis at 2015 + 1.2 = 2016.2
  • size =3.5, hjust = 0.0: These adjust the size and horizontal placement of the labels.

Cumulative Visits Charts

Jumping down the code a bit, you’ll see that some clean-up has been done before creating the graphs. Take a look at the difference between the two following graphs:

Why does this happen? It has to do with the fact that some of the countries do not appear in all the years. Take a read through This StackExchange post for details. To fix this, we can use the xtabs function to insert zeros into the data for all missing points:

Visitsnew <-xtabs(Visits~Country+Year, fromCanada) %>%
  as.data.frame() %>%
  transmute(Country, Year, Visits = Freq)
  1. xtabs(Visits~Country+Year, fromCanada):
    • This creates a matrix of Country by Year, with Visits as the entry in the matrix. 0 is inserted if unknown.
  2. transmute(...)
    • Similar to mutate, however it throws out any columns you don’t name. Thus here we’re just renaming Freq to Visits.

Last Notes

A quick shout out again to the dplyr package – in line 226 you’ll notice the use of left_join(). While R has similar functions, I prefer the dplyr version as it is almost identical to the more familiar SQL functions. Type ?left_join into R to see all the available joins.

Want more detail on any of the code? Leave me a comment and I’ll get back to you!

Aspiring Actuary: 5 Tips for Your First Actuarial Exam

As all good (aspiring) actuaries do I will start this article with a few assumptions. First, I assume you have some familiarity with the actuarial exam process (check out my previous article if needed). I’m also going to assume that you are writing one of the Computer-Based Testing (CBT) Exams – likely Exam P or Exam FM. Here are a few tips I’ve compiled that have helped me succeed with the “prelim” SOA/CAS exams.

1. Start Early

If I wanted to make this article even more click-baity than it already was, I would list this as Step 0. Underestimating the time commitment it takes to write these exams is the number one trap I see people falling into. I would equate the time spent on a prelim exam approximately to an upper year university math course – something that you can’t (or shouldn’t) study for in a just a few days. Remember, the pass mark on these exams is ~70%, which means there is less room for error than what you may be used to.

The exam syllabus has a wide breadth. I would use the 100 meter puddle analysis; you are covering a little bit of each of many topics. This lends itself well to setting a schedule. When you sign up for your exam date, take a look at the number of days you have left and set yourself up an Excel spreadsheet with all the topics and your goal completion date. Don’t forget to leave yourself time for practice problems!

2. Get a Good Calculator, and Get Good With It.

This is a very close second in my list of most important tips. The exams will be crunched for time, and it is essential that you get familiar and accurate with your calculator. My Exam 5 (CAS – Basic Ratemaking) proctor make a comment after our exam about this: “It’s amazing to see you kids these days, bringing multiple calculators to the exam and using both your thumbs to type!”

Remember that the SOA has an approved list of calculators. For all exams I would recommend the TI-30XS MultiView (You can support this blog by buying it via the Amazon link). It has the advantage of being able to easily scroll through past calculations as well as many summary statistic functions which can provide shortcuts for certain problems. Above all the buttons on this calculator are responsive and accurate, meaning fewer mistakes. If you have any questions on how to use it please leave me a comment.

3. Choose the Right Study Material (For You)

You are “lucky” in the sense that the prelims have a large amount of study material to choose from. Many companies publishes study guides, practice exams and online exams to help you learn the material and get a feel for the types of questions you’ll be asked. Of course, you can also study from the source material (the textbooks found on the syllabus), however many exam passers you speak to will argue for their alternative study source.

Of course the one caveat for these materials is that they can get quite expensive, and seeing as you are writing your first actuarial exam you may not have the backing of a company yet to pay for it. In this case I would recommend trying the free samples that each of the companies put out – you’ll have the advantage of finding the study material that works best for you as well as getting some extra study time in.

4. Utilize Old Exams and Example Questions

These are the single most important study tool at your disposal. They’re also free! For the CBT exams, the SOA has released sets of practice questions as well as online practice exams for Exam P and Exam FM. For whatever reason the SOA doesn’t seem to have the .PDFs posted on a page on their website, but they’re easy enough to find by Googling “SOA Exam X Practice Problems”.

There will always be debates as to whether these questions are harder or easier than the actual exams, but they are definitely an excellent way to familiarize yourself with the SOA’s exam problem wording. I recommend you study these closely and ensure that you fully understand the answers.

5. Ask Yourself If You’re Being Productive

You may have heard the “100 hour rule” that is thrown around in actuarial circles. It states that for every hour an exam is, you should put in 100 hours worth of study time. For the first few exams this equates to 300 hours of quality studying. Remember, the pass rate for Exam P hovers around 50% (see Actuarial-Lookup.com). While there are always exceptions, I would be comfortable with claiming that if you follow this rule and put in 300 hours of dedicated studying time I would expect you to be in the passing half of the group. This requires discipline – ask yourself while you’re studying if you’re actually studying. Did you spend 20 minutes of the last hour browsing Facebook? That time doesn’t count.

At the same time, ensure that you take breaks. Or, if you’re on a roll but finding the material isn’t sticking, I find that switching to a new topic or question set will reboot my motivation.

Actuarial Science in Python – Loss Triangles

Loss Triangles are an essential tool for Property & Casualty actuary. They are a visual representation of the development of claims overtime, and are a starting point for many predictive analysis techniques such as the Chain Ladder method.

The code and example data is available at my GitHub.

For this example, we will look at a Policy Year  triangle with 12 Month developments. Since the purpose of this article is to demonstrate how to create a loss triangle in Python, I will not go into too much of the technical, but look out for future posts where I’ll go into more details.

The following triangle (created using Excel and simulated data) is an example of our end goal. Data is presented as of December-31st-2017.

ExampleCumulativeTriangle.PNG

How do we interpret this triangle? Let’s take a look at the first row, denoted by Policy Year 2014. Any policy which starts (the “effective date”) anytime in 2014 will fall into this category. The columns denote the development in months. For policy year 20XX and development Y, the corresponding cell represents the cumulative claims reported for policies written in 20XX as of Y months from January-1-20XX.

For example, take a look at PY 2014 development 24. This tells us that: “As of Dec-31st-2015, there have been $1,000 worth of claims reported on policies written during 2014″. Take a moment to reason to yourself why the data presents itself as a triangle. You’ll note that as each 12 month period passes, a new entry is added for each policy year along the diagonal, hence increasing the size of the triangle.

Loss Triangles In Python

More often than not, loss triangles are created in Excel. While this works well for creating the triangles, learning how to replicate this process in Python (or R!) will allow you to easily apply analysis techniques via additional Python scripts.

First, let’s take a look at the raw transactional data (again, simulated for example purposes):

This looks like typical claims data. Each row represents a transaction, and we’re given the Policy ID, Claim ID, Dates and Transaction Amounts. In this example, we will be focusing on creating a Policy Year claims triangle as is the above example.

If you do not have Python set up yet, I recommenced doing a Google search for one of the many guides or videos to help. We will be using the NumPy and Pandas packages:

import numpy as np
import pandas as pd

We will now import the data. It is good practice to check to see if there is any missing data as well:

path = "C:/MY_PATH/ClaimsExample.csv"
claims = pd.read_csv(path, sep = ",")
claims = pd.DataFrame(claims)

# Check for missing data
claims.isnull().any()
Out[8]: 
Policy_ID                False
Claim_ID                 False
Policy_Effective_Date    False
Accident_Date            False
Report_Date              False
Transaction_Date         False
Indemnity                False
Expenses                 False
Total                    False
dtype: bool

Now, the first thing we’ll need to calculate is the development or lag for each line. In this case, the lag is calculated as the Transaction Year – Policy Year + 1. Note the semantics here – we are calculating the amount reported for the Policy as at Dec/31 of the Transaction year, thus the “+1” is needed.

We must extract the Year from the given dates. The methodology used will depend on your data formatting.

claims['Policy_Year'] = [int(x[-4:]) for x in claims['Policy_Effective_Date']]
claims['Transaction_Year'] = [int(x[-4:]) for x in claims['Transaction_Date']]

Next, we can calculate the lags.

claims['PY_Lag'] = claims['Transaction_Year'] - claims['Policy_Year'] + 1

Let’s take a look at a few rows of our data now for spot checking.

claims[:5]
Out[23]: 
   Policy_ID  Claim_ID Policy_Effective_Date Accident_Date Report_Date  \
0        101         1              1/1/2010      7/1/2010    8/1/2010   
1        101         1              1/1/2010      7/1/2010    8/1/2010   
2        101         1              1/1/2010      7/1/2010    8/1/2010   
3        101         1              1/1/2010      7/1/2010    8/1/2010   
4        101         1              1/1/2010      7/1/2010    8/1/2010   

  Transaction_Date  Indemnity  Expenses  Total  Policy_Year  Transaction_Year  \
0         9/1/2010      100.0      50.0  150.0         2010              2010   
1        11/1/2010      200.0      20.0  220.0         2010              2010   
2         1/1/2011      300.0       0.0  300.0         2010              2011   
3         3/1/2011      200.0       0.0  200.0         2010              2011   
4         5/1/2011       50.0       0.0   50.0         2010              2011   

   PY_Lag  
0       1  
1       1  
2       2  
3       2  
4       2 

Looks like our code worked! As expected, lines with the same PY and TY show a lag of one (e.g. 12 months).

Next we get to use a bit of Pandas Dataframe manipulation. We are only interested in the development in the aggregate, thus we need to sum over the policies at each lag for each policy year. We’re going to combine losses and expenses in this example, thus we use the total column.

Here is the code and result, which I will walk through:

py_data = claims['Total'].groupby([claims['Policy_Year'],claims['PY_Lag']]).sum().reset_index()
py_data
Out[25]: 
   Policy_Year  PY_Lag    Total
0         2010       1    370.0
1         2010       2   6910.0
2         2010       3   -200.0
3         2010       4   -600.0
4         2011       1    960.0
5         2011       2   4320.0
6         2011       3    100.0
7         2012       1   1090.0
8         2012       2   8550.0
9         2013       1  21720.0

This may seem complicated, but we can break it down into steps:

  • claims['Total'] : We are choosing the Total column
  • groupby([claims['Policy_Year'],claims['PY_Lag']]) : We want to see the total column in the aggregate for each combination of Policy_Year and PY_Lag
  • sum(): We are choosing sum as our aggregate function
  • reset_index() : This is a bit trickier, it forces the DataFrame to reset the “index” for each line so that Policy_Year and PY_Lag are forced into their own columns. This may be easier to explain by looking at the result without the reset_index function. Try it out yourself!

We are now ready to make out triangle. The Panadas function “pivot_table” works very similarly to Pivot Tables in Excel, which you might be more familiar with. We first need to calculate the cumulative sum at each lag for our policy years. This is done using a similar technique to above.

py_data['cumsum'] = py_data["Total"].groupby(py_data["Policy_Year"]).cumsum()
py_cumu_triangle = pd.pivot_table(py_data, index = ["Policy_Year"], columns = ["PY_Lag"], values = ["cumsum"])
print(py_cumu_triangle)
              cumsum                        
PY_Lag             1       2       3       4
Policy_Year                                 
2010           370.0  7280.0  7080.0  6480.0
2011           960.0  5280.0  5380.0     NaN
2012          1090.0  9640.0     NaN     NaN
2013         21720.0     NaN     NaN     NaN

Success! We have created a loss triangle using Python. If you are familiar with loss triangles you will notice that this one does not look like a triangle you would expect for stable business. That, however, is something to be discussed in a future post.

Post Script:

We are also sometimes interested in a incremental loss triangle. We can easily edit the code above to accomplish this. Try out the code below yourself:

py_triangle = pd.pivot_table(py_data, index = ["Policy_Year"], columns = ["PY_Lag"], values = ["Total"])
print(py_triangle)
               Total                      
PY_Lag             1       2      3      4
Policy_Year                               
2010           370.0  6910.0 -200.0 -600.0
2011           960.0  4320.0  100.0    NaN
2012          1090.0  8550.0    NaN    NaN
2013         21720.0     NaN    NaN    NaN