For this project, you are going to be working with a dataset from a major cruise company to predict how much
money passengers will spend while onboard their cruise. This information is important for revenue forecasting.
Identifying high spending customer segments is additionally useful for targeted marketing campaigns.
The dataset contains information on 5,000 randomly sampled passengers from 2016. The response variable
is Onboard.Amt, which represents the total amount of money that passenger spent while onboard. The
remaining 20 variables are candidate predictor variables.
Your task is to model Onboard.Amt USING A MULTIPLE LINEAR REGRESSION MODEL, and
investigate the question:
• “What type of passenger is spending the most money?”
• Half-way check – 10%
– Half page written and evidence of having done something in R
– Wednesday 3/22 by Midnight
• Final Write-up – 40%
– Wednesday 3/29 by Midnight
– Two page minimum (excluding appendix, but including tables, equations, relevant plots, etc)
– Should include:
∗ Intro – What are you doing and why? Include discussion of data.
∗ Methods and model – What model are you using, what assumptions are you making, how did
you select your model, etc..
∗ Results – Parameter estimates for your model, things like R2, etc. What do the results mean?
How are they interpreted in the context of the problem? What information have you gained?
∗ Conclusion – Summarize everything briefly. Mention any shortcomings are anything that could
be improved upon in the future, if you had more/better data.
∗ Appendix – Things you should do but don’t belong in the paper. Anything cool that you
want credit for but doesn’t necessarily belong in the paper. For example, residual plots for
diagnostics, any exploratory plots, etc. Any difficult R code you’re happy with.
• Reproducibility – 30%
– You should submit an R file along with your paper
– I should be able to see exactly what you did. Did you create new variables? Did you delete certain
– I should be able to run it with no problems.
– I should be able to sub in a new data set and click “run”
• Modeling – 20%
– Did you arrive at a reasonable model?
– Did you check the assumptions?
– Did you compare various models?
Just for Fun
• Using a separate dataset of 1,305 passengers on an upcoming cruise:
1. Forecast total onboard revenue using your final model
– Sum of fitted values for every passenger
2. Guess which 50 passengers will spend the most using your model
– Sum actual spending for the 50 passengers with highest expected spending under your model
• Winner of each will receive a prize (TBD)
The following code will read the dataset in and ensure proper variable formats.
classes <- c(‘numeric’, ‘Date’, ‘character’, ‘numeric’, ‘character’, ‘numeric’, ‘numeric’,
‘character’, ‘character’, ‘character’, ‘character’, ‘character’, ‘numeric’, ‘character’,
‘character’, ‘character’, ‘numeric’, ‘character’, ‘character’, ‘numeric’, ‘numeric’)
data <- read.csv(‘CHANGE THIS/data.csv’, colClasses=classes)
• I encourage you to think of additional variables you might create. I will help you with the R code if
you come up with something creative but don’t know how to program it (though I encourage you to
Google it first, everything you want to do in R will have a simple solution online). Example: “Hey I see
there’s a variable that provides the date, I’d like to extract the day of the week out of that because I
think it might be a useful predictor, can you help me with that?”