**Note: You need to submit your answers in a word documents. You need to copy the results that you need to answer the questions from the excel file into the word document. In addition, you must submit your Excel files but note that only the word document will be marked. **

**SECTION A: Discussion Questions**

** **

- Explain the concept of having the imbalance data in classification techniques and the way that it should be treated in developing the classification models?
- Explain the concept of over-fitting. Explain how overfitting can be avoided?
- Give two examples of how logistics regression can be used. You only need to explain the problem. One example is the bank that are using logistic regression to classify its new customers for loan approval. The bank wanted to identify customers that are more likely to default on their loan. Explain why you cannot use linear regression in your examples.

(5+4+6 = 15 marks)

**SECTION B: QUANTITATIVE QUESTIONS**

- There are 500 client records in the file Toy-Info which have shopped many special toys from an e-Business website. Each record includes data on types of product purchased (between 1-5), purchase amount ($), age, gender, marital status, whether the client has a membership and whether the customer has a discount card.

Apply k-means clustering on all six variables. Recommend a proper value of k and explain how you found this. Describe these k clusters by their average characteristics. (Hint: Compare the average distance of observations in each cluster with the distance between this cluster and the others)

(8 marks)

- In order to improve the overall quality of a new material, a chemist experiments with the effect of two indices (A & B) on each other in her laboratory. In the following table, the values of these indices have been captured for each experiment:

No. of experiment |
Index A |
Index B |

1 | 248 | 29915 |

2 | 247 | 29915 |

3 | 247 | 29991 |

4 | 253 | 29807 |

5 | 251 | 29965 |

6 | 230 | 29620 |

7 | 232 | 29526 |

8 | 237 | 29383 |

9 | 233 | 29345 |

10 | 242 | 29711 |

11 | 242 | 29570 |

12 | 245 | 29822 |

- Plot a scatter chart for this data where index A is the independent variable. What does the scatter chart indicate about the relationship between indices A and B? How strong is the relationship? Create an estimated simple regression model and write the equation?
- Apply and investigatea polynomial regression model that includes intercept and terms x, x^2 and x^3. Is this new model statistically significant?Is this new model better than the linear model, explain?

(8+7 = 15 marks)

- The following data is the results of a 4- year study conducted to assess how age, weight, and gender influence the risk of diabetes. Risk is interpreted as the probability (times 100) that the patient will have diabetes over the next 4-year period.

- Develop a multiple regression model that relates risk of diabetes to the person’s age, weight and the gender. Present the regression formula as a mathematical equation. Interpret the coefficients of the regression and comment on the strength of the regression.
- Develop an estimated multiple regression model that relates risk of diabetes to the person’s age, weight, gender and life style. Present the regression formula as a mathematical equation. Interpret the coefficients of the regression and comment on the strength of the regression.
- What is the risk percentage of diabetes over the next 4 years for a 55-year-old man living in a big city with 70 kg weight?

Age |
Weight (Kg) |
Gender |
Life style |
Risk (%) |

53 | 78 | Female | Small town | 40 |

24 | 77 | Male | Big city | 23 |

77 | 83 | Female | Country | 67 |

88 | 89 | Female | Small town | 71 |

56 | 65 | Male | Big city | 45 |

71 | 82 | Female | Country | 54 |

53 | 79 | Female | Small town | 48 |

70 | 66 | Male | Small town | 49 |

80 | 80 | Female | Big city | 65 |

78 | 67 | Male | Big city | 59 |

71 | 69 | Male | Big city | 56 |

70 | 78 | Female | Small town | 59 |

67 | 75 | Male | Country | 46 |

77 | 95 | Female | Big city | 64 |

60 | 57 | Male | Country | 39 |

82 | 100 | Female | Big city | 73 |

66 | 85 | Male | Small town | 63 |

80 | 96 | Male | Big city | 87 |

62 | 83 | Female | Country | 52 |

59 | 93 | Male | Big city | 61 |

(5 +5+4= 14 marks)

- An internet provider company in Australia is interested in identifying the reason for individuals who are still undecided in buying the new NBN service of the company. The file NBN-service contains data on a sample of customers with variables that tracked the decision outcome.

Create a standard partition of the data with all tracked variables and 40% of observations in the training set, 35% in the validation set, and 25% in the test set. Construct a logistic regression model to classify undecided customers of the company. Use Undecided as the output variable and Contract duration, Bonus data and Usageas the input variables. (cut-off value equals to 0.5 and success is 1)

- Write the obtained logistic regression equation and predict a customer with Contract duration of 16 months, Bonus data of 63 GB and Usage of 237 GB whether he/she will decide to buy the new service or not? Explain how you found the prediction.
- Comment on the accuracy of the model. Report the class 1 and class 0 error and explain which kind of these errors are more undesirable in this example?
- AddLast plan variable to the model and create another logistic regression model. Compare the accuracy of this new model with the model obtained in previous part. Which one do you recommend?

(7+5+8 = 20 marks)

- A put option in finance allows you to sell a share of stock in the future at a given price. There are different types of put options. A European put option allows you to sell a share of stock at a given price (called the exercise price) at a particular point in time after the purchase of the option. For example, suppose you purchase an eight-month European put option for a share of stock with an exercise price of $29. If eight months later, the stock price per share is $29 or more, the option has no value. If in six months timethe stock price is lower than $29 per share, then you can purchase the stock and immediately sell it at the higher exercise price of $29. If the price per share in eight months is $26.4, you can purchase a share of the stock for $26.4 and then use the put option to immediately sell the share for $29. Your profit would be the difference, $29-$26.4 = $2.6 per share, less the cost of the option. If you paid $1.5 per put option, then your profit would be $2.6-$1.5=$1.1 per share.

- Build a model to calculate the profit of this European put option.
- Construct a data table that shows the profit per share for a share price in eight months between $15 and $35 per share in increments of $1.

(7+7 = 14 marks)

- FSUB is a companythat intended to introduce its product by advertising them in 3 relevant websites. The names of these websites are determined as A, B and C by the marketing manager of the company. Viewer estimates, and costs per advertisement are as shown:

Limitations |
Website A |
Website B |
Website C |

Viewer per advertisement | 120,000 | 25,000 | 60,000 |

Cost per advertisement | $2,500 | $600 | $800 |

To ensure a balanced use of advertising, website B advertisements must not exceed 60% of the total number of advertisements authorized. In addition, website A should account for at least 20% of the total number of advertisements authorized.

- If the budget is limited to $65,900, how many advertisements should be ordered on each website to maximize total viewers? What is the allocation of the budget among the three websites, and what is the total viewers reached?
- Rewrite the model and answer this question: what would be the cost per advertisement of website A such that exactly 15 advertisements be allocated to website B if the company aim to maximize the total number of viewers? (Assume the cost per advertisements of website A and website C remain same as presented in the above table).

(8+6=14 marks)

** **

** TOTAL MARKS= 100**