Part 1 Point Estimation

Question 1

You have an urn that contains some red balls and blue balls but we don’t know how many of each or the relative ratio of red to blue balls. n balls are sampled with replacement, in which there are k blue balls. Formulate an expression for the probability of getting k blue balls in the sample and use the concept of maximum likelihood estimation to derive the formula for the estimate of the ratio of the number of red balls divided by the number of blue balls in the urn, p.

Answer 1

i) Epression of the probability of getting k blue balls in the drawn sample

Let be r and b, the unknown numbers of blue and red balls present in the urn respectively.

Drawing k blue balls out of the urn after n trials with replacement is a binomial experience, with drawing a blue ball at each trial or not as the Bernoulli experience associated. The probability of success of the Bernoulli experience is: \(p_{s}=\frac{b}{r+b}=\theta\)

Let be \((X_{i})_{i \in \{1,2, ..., n \}}=\begin{cases} 1 & if\;the\;i^{th}\;choosen\;ball\;is\;blue \\ 0 & if\;the\;i^{th}\;choosen\;ball\;is\;red \end{cases}\), n random i.i.d variables; \((X_{i})_{i \in \{1,2, ..., n \}} \sim Bernouilli(p_{s})\)

Let be X, the random variable associated to our studied binomial experience. Hence our probability formula expression is:

\[\boxed{P(X=k) = {n \choose k}(p_{s})^{k}(1-p_{s})^{n-k}}\]

ii) Using the concept of MLE, let’s derive the formula for the estimate of the ratio of the number of red balls divided by the number of blue balls in the urn, p:

Here, we look at maximum likelihood estimate of the probability parameter \(\frac{r}{b}\) in the Bernouilli distribution. Remember that the probability distribution for Bernouilli distribution with probability \(p_{s}=\theta\) is: \[p(y\mid\frac{r}{b}) = (\theta)^{y}(1-\theta)^{1-y}\] where \(y \in \{0, 1\}\) is a binary variable. If Y comes from a Bernouilli distribution with probability \(\theta\), we can write \(Y \sim Be(\theta)\). Let \(y = (y_{1} ... y_{n})\) be a vector of n binary random variables (0 or 1). We are interested in trying to estimate the unknown \(\frac{r}{b}\) from our observations y using maximum likelihood.

ii-1) Likelihood

First, write down the likelihood of the sequence y if each \(y_{i}\) is modeled using an independent Bernoulli distribution with the same probability \(\frac{1}{1+\frac{r}{b}}\), i.e., what is: \[p(y\mid\frac{r}{b}) = \prod_{i=1}^{n}p(y_{i}\mid\frac{r}{b})\] equal to?

Answer: The likelihood is equal to the product of the individual probabilities of the Bernouilli RVs. i.e., \[ \begin{align} p(y\mid\frac{r}{b}) & =(\frac{1}{1+\frac{r}{b}})^{y_{1}}(1-\frac{1}{1+\frac{r}{b}})^{1-y_{1}}.(\frac{1}{1+\frac{r}{b}})^{y_{2}}(1-\frac{1}{1+\frac{r}{b}})^{1-y_{2}} ...(\frac{1}{1+\frac{r}{b}})^{y_{n}}(1-\frac{1}{1+\frac{r}{b}})^{1-y_{n}} \\ & = (\frac{1}{1+\frac{r}{b}})^{k}(1-\frac{1}{1+\frac{r}{b}})^{n-k} \\ & = (\frac{1}{1+\frac{r}{b}})^{k}(\frac{\frac{r}{b}}{1+\frac{r}{b}})^{n-k} \end{align} \] where \(k=\sum_{i=1}^{n}y_{i}\) is the number of successes in the n trials (number of blue balls drawn), so n-k is the number of failures (number of red balls drawn).

ii-2) Negative log-likelihood

Next, let’s take the negative logarithm (using the natural logarithm ln()) of the likelihood to find the negative log-likelihood and simplify it if possible:

\[ \begin{align} l(y; \frac{r}{b}) & = -lnp(y\mid\frac{r}{b})\\ & = -ln[ (\frac{1}{1+\frac{r}{b}})^{k}(\frac{\frac{r}{b}}{1+\frac{r}{b}})^{n-k}] \\ &= -klog(\frac{1}{1+\frac{r}{b}})-(n-k)log(\frac{\frac{r}{b}}{1+\frac{r}{b}}) \end{align} \] where we use the fact that : \((i) log(ab)=loga+logb \; and \; (ii)log(a^{n})=nloga\).

ii-3) Find the MLE

Now, given the negative log-likelihood, we want to find the maximum likelihood estimator of the ratio \(\frac{r}{b}\) for a given data sample y, which we will call \(\hat{\frac{r}{b}_{ML}(y)}\). To do this,

\(\alpha)\) Let’s differentiate our negative log-likelihood function with respect to \(\frac{r}{b}\)

\(\beta)\) Let’s set the derivative to zero and

\(\gamma)\) solve for \(\frac{r}{b}\).

Answer:

Let’s begin by differentiate the negative log-likelihood with respect to our parameter of interest \(\frac{r}{b}\):

\[ \begin{align} \frac{dl(y; \frac{r}{b})}{d\frac{r}{b}}&=\frac{d[-klog(\frac{1}{1+\frac{r}{b}})-(n-k)log(\frac{\frac{r}{b}}{1+\frac{r}{b}})]}{d\frac{r}{b}}\\ &= \frac{d}{d\frac{r}{b}}[-klog(\frac{1}{1+\theta})-(n-k)log(\frac{\theta}{1+\theta})], with \; \theta = \frac{r}{b} \\ &= -k[-\frac{1}{(1+\theta)^{2}}\times(1+\theta)]-(n-k)[\frac{(1+\theta)-\theta(1)}{(1+\theta)^{2}}\times\frac{1+\theta}{\theta}]\\ &=-k[-\frac{1}{(1+\theta)^{2}}\times(1+\theta)]-(n-k)[\frac{1}{\theta(1+\theta)}]\\ &=\frac{k}{1+\theta}-\frac{n-k}{\theta(1+\theta)}\\ &=\frac{k\theta-n+k}{\theta(1+\theta)}\\ &=\frac{k(\theta+1)-n}{\theta(1+\theta)}\\ &=\frac{k}{\theta}-\frac{n}{\theta(1+\theta)} \end{align} \]

Then we need to set this to zero and solve for \(\theta\):

\[ \begin{align} \frac{k}{\theta}-\frac{n}{\theta(1+\theta)}&=0\\ \implies \frac{k(\theta+1)-n}{\theta(1+\theta)}&=0\\ \implies k(\theta+1)-n &=0\\ \end{align} \]

so that the ML estimator for the ratio of the number of red balls divided by the number of blue balls in the urn, \(\frac{r}{b}\) is: \[\boxed{\hat{\frac{r}{b}_{ML}(y)}=\frac{n}{k}-1}\].

which is the inverse of the fraction of success in total number of trials minus 1.

Question 2

Suppose that we know that a random variable X follows the distribution given below:

\[PDF(X=x\mid\theta)=\frac{{2 \choose x}\theta^{x}(1-\theta)^{2-x}}{1-(1-\theta)^{2}}, x={1, 2}\]

Imagine that we observe a sample of n i.i.d random variables \(x=(x_{1}, ..., x{n})\) and want to model them using this distribution. Please use the concept of maximum likelihood to estimate for the parameter \(\theta\).

Answer 2

Here, we look at maximum likelihood estimate of the probability parameter \(\theta\) in the given distribution. We are interested in trying to estimate the unknown \(\theta\) from our observations x using maximum likelihood.

1) Likelihood

First, write down the likelihood of the sequence x if each \(x_{i}\) is modeled using an independent given distribution, i.e., what is: \[p(x\mid\theta) = \prod_{i=1}^{2}p(x_{i}\mid\theta)\] equal to?

Answer: The likelihood is equal to the product of the individual probabilities of the distribution RVs. i.e., \[ \begin{align} p(x\mid\theta) & = \frac{{2 \choose 1}\theta^{1}(1-\theta)^{2-1}}{1-(1-\theta)^{2}}\times\frac{{2 \choose 2}\theta^{2}(1-\theta)^{2-2}}{1-(1-\theta)^{2}} \\ & = \frac{2\theta(1-\theta)}{\theta(2-\theta)}\times\frac{\theta^{2}}{\theta(2-\theta)} \\ & = \frac{2(1-\theta)}{2-\theta}\times\frac{\theta}{2-\theta} \\ & = \frac{2\theta(1-\theta)}{(2-\theta)^{2}} \\ \end{align} \] 2) Negative log-likelihood

Next, let’s take the negative logarithm (using the natural logarithm ln()) of the likelihood to find the negative log-likelihood and simplify it if possible:

\[ \begin{align} l(x; \theta) & = -lnp(x\mid\theta)\\ & = -ln[ \frac{2\theta(1-\theta)}{(2-\theta)^{2}}] \\ &= -log(2)-log(\theta)-log(1-\theta)+2log(2-\theta) \end{align} \] where we use the fact that : \((i) log(\frac{a}{b})=log(a)-log(b) \;(ii) log(ab)=loga+logb \; and \; (iii)log(a^{n})=nloga\).

3) Find the MLE

Now, given the negative log-likelihood, we want to find the maximum likelihood estimator of the parameter \(\theta\) for a given data sample x, which we will call \(\hat{\theta_{ML}(x)}\). To do this,

\(\alpha)\) Let’s differentiate our negative log-likelihood function with respect to \(\theta\)

\(\beta)\) Let’s set the derivative to zero and

\(\gamma)\) solve for \(\theta\).

Answer:

Let’s begin by differentiate the negative log-likelihood with respect to our parameter of interest \(\theta\):

\[ \begin{align} \frac{dl(x; \theta)}{d\theta}&=\frac{d[-log(2)-log(\theta)-log(1-\theta)+2log(2-\theta)]}{d\theta}\\ &= -\frac{1}{\theta}-\frac{1}{1-\theta}-\frac{2}{2-\theta}\\ &= -\frac{2-3\theta}{\theta(1-\theta)(2-\theta)} \end{align} \]

Then we need to set this to zero and solve for \(\theta\):

\[ \begin{align} -\frac{2-3\theta}{\theta(1-\theta)(2-\theta)}&=0\\ \implies 2-3\theta&=0\\ \end{align} \]

so that the ML estimator for the parameter \(\theta\) is: \[\boxed{\hat{\theta_{ML}(x)}=\frac{2}{3}}\].

Question 3

Let \(x=(x_{1}, x_{2}, ... , x_{n})\) is a sample obtained from a Bernouilli distribution, i.e., \(x_{i} \sim Be(\theta)\) and \(x_{i} = \{0, 1\}\). Please answer the following questions:

  1. Derive the MLE estimator for \(\theta\), i.e., \(\hat{\theta_{MLE}}\), and show that it is unbiased.

  2. Find an unbiased estimator for \(\theta^{2}\).

Answer 3

a)

i) MLE estimator for \(\theta\):

Here, we look at maximum likelihood estimate of the probability parameter \(\theta\) in the Bernouilli distribution. Remember that the probability distribution for Bernouilli distribution with probability \(\theta\) is: \[p(x\mid\theta) = (\theta)^{x}(1-\theta)^{1-x}\] where \(x \in \{0, 1\}\) is a binary variable. If X comes from a Bernouilli distribution with probability \(\theta\), we can write \(X \sim Be(\theta)\). Let \(x = (x_{1} ... x_{n})\) be a vector of n binary random variables (0 or 1). We are interested in trying to estimate the unknown probability \(\theta\) from our observations x using maximum likelihood.

i-1) Likelihood

First, write down the likelihood of the sequence x if each \(x_{i}\) is modeled using an independent Bernoulli distribution with the same probability \(\theta\), i.e., what is: \[p(x\mid\theta) = \prod_{i=1}^{n}p(x_{i}\mid\theta)\] equal to?

Answer: The likelihood is equal to the product of the individual probabilities of the Bernouilli RVs. i.e., \[ \begin{align} p(x\mid\theta) & =(\theta)^{x_{1}}(1-\theta)^{1-x_{1}}.(\theta)^{x_{2}}(1-\theta)^{1-x_{2}} ...(\theta)^{x_{n}}(1-\theta)^{1-x_{n}} \\ & = (\theta)^{k}(1-\theta)^{n-k} \\ \end{align} \] where \(k=\sum_{i=1}^{n}x_{i}\) is the number of successes in the n trials, so n-k is the number of failures.

i-2) Negative log-likelihood

Next, let’s take the negative logarithm (using the natural logarithm ln()) of the likelihood to find the negative log-likelihood and simplify it if possible:

\[ \begin{align} l(x; \theta) & = -lnp(x\mid\theta)\\ & = -ln[ (\theta)^{k}(1-\theta)^{n-k}] \\ &= -klog(\theta)-(n-k)log(1-\theta) \end{align} \] where we use the fact that : \((i) log(ab)=loga+logb \; and \; (ii)log(a^{n})=nloga\).

i-3) Find the MLE

Now, given the negative log-likelihood, we want to find the maximum likelihood estimator of the probability \(\theta\) for a given data sample x, which we will call \(\hat{\theta_{ML}(x)}\). To do this,

\(\alpha)\) Let’s differentiate our negative log-likelihood function with respect to \(\theta\)

\(\beta)\) Let’s set the derivative to zero and

\(\gamma)\) solve for \(\theta\).

Answer:

Let’s begin by differentiate the negative log-likelihood with respect to our parameter of interest \(\theta\):

\[ \begin{align} \frac{dl(x; \theta)}{d\theta}&=\frac{d[-klog(\theta)-(n-k)log(1-\theta)]}{d\theta}\\ &= -\frac{k}{\theta}+\frac{n-k}{1-\theta}\\ &=\frac{-k+n\theta}{\theta(1-\theta)} \end{align} \]

Then we need to set this to zero and solve for \(\theta\):

\[ \begin{align} \frac{-k+n\theta}{\theta(1-\theta)}&=0\\ \implies -k+n\theta&=0\\ \end{align} \]

so that the ML estimator for the probability of the Bernoulli distribution, \(\theta\) is: \[\boxed{\hat{\theta_{ML}(x)}=\frac{k}{n}}\].

which is the fraction of success in total number of trials.

ii) Let’s show that the MLE estimator for \(\theta\) is unbiased:

First we note the expected value of \(\hat{\theta_{ML}(x)}\) is: \[\mathbb{E}[\hat{\theta_{ML}(x)}]=\mathbb{E}[\frac{k}{n}]=\mathbb{E}[\frac{\sum_{i=1}^{n}X_{i}}{n}]=\frac{1}{n}\mathbb{E}[\sum_{i=1}^{n}X_{i}]=\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}[X_{i}]=\frac{1}{n}n\theta=\theta\]. Therefore the bias of \(\hat{\theta_{ML}(x)}\) is :

\[\boxed{Bias(\hat{\theta_{ML}(x)})=\mathbb{E}[\hat{\theta_{ML}(x)}]-\theta=\theta-\theta=0}\].

Thus, the MLE estimator for \(\theta\) is unbiased.

b) Let’s find an unbiased estimator for \(\theta^{2}\).

What if we simply took as our estimator for \(p^{2}\) : \((\hat{\theta})^{2}=(\frac{1}{n}\sum_{i=1}^{n}X_{i})^{2}\)

Expectation value of \((\hat{\theta})^{2}\)

\(E[(\frac{1}{n}\sum_{i=1}^{n}X_{i})^{2}]=\frac{1}{n^{2}}\sum_{i=1}^{n}\sum_{j=1}^{n}E[X_{i}X_{j}].\)

When \(i\neq j\), \(E[X_{i}X_{j}]=E[X_{i}]E[X_{j}]\) because they are independent; but when \(X_{i}=X_{j}\), we have \(X_{i}X_{j}=X_{i}^{2}\),

hence

\(E[X_{i}^{2}]=Var[X_{i}]+E[X_{i}]=\theta(1-\theta)+\theta^{2}=\theta\)

Since in our double sum we have n instances where i = j and \(n^{2}\) terms in total, it follows that

\(E[(\hat{\theta})^{2}]=\frac{1}{n^{2}}((n^{2}-n)\theta^{2}+n\theta)=\frac{\theta(1-\theta)}{n}+\theta^{2}\).

So the bias here is \(\frac{\theta(1-\theta)}{n}\). To rid ourselves of it, we should collect like terms in \(\theta\) to get

\[E[(\hat{\theta})^{2}]=\frac{\theta}{n}+\frac{n-1}{n}\theta^{2}\]

from which we see from linearity of expectation that:

\[ \begin{align} E[(\hat{\theta})^{2}-\frac{\hat{\theta}}{n}]&=E[(\hat{\theta})^{2}]-\frac{1}{n}E[\hat{\theta}]\\ &=(\frac{\theta}{n}+\frac{n-1}{n}\theta^{2})-(\frac{\theta}{n})\\ &=\frac{n-1}{n}\theta^{2}\\ \implies \frac{n}{n-1}E[(\hat{\theta})^{2}-\frac{\hat{\theta}}{n}]&=\theta^{2}\\ \implies E[\frac{n}{n-1}((\hat{\theta})^{2}-\frac{\hat{\theta}}{n})]&=\theta^{2}\\ \implies E[\frac{n(\hat{\theta})^{2}}{n-1}-\frac{\hat{\theta}}{n-1}]&=\theta^{2} \end{align} \]

Therefore an unbiased estimator for \(\theta^{2}\) is:

\[ \begin{align} \hat{\theta^{2}}&=\frac{n(\hat{\theta})^{2}}{n-1}-\frac{\hat{\theta}}{n-1}\\ &=\frac{n(\frac{k}{n})^{2}}{n-1}-\frac{\frac{k}{n}}{n-1}\\ &=\frac{\frac{k^{2}}{n}}{n-1}-\frac{\frac{k}{n}}{n-1}\\ &=\frac{k^{2}-k}{n(n-1)}\\ \end{align} \] Hence \[\boxed{\hat{\theta^{2}}=\frac{k^{2}-k}{n(n-1)}}, with \; k = \sum_{i=1}^{n}X_{i}\]

Part 2 Confidence Interval Estimation & Central Limit Theorem

Question 1

We randomly interview students for their attitude towards SAT survey and find that there are only 8 students out of a sample size of 80 would deter from filling in the SAT survey. Assuming that students’ attitudes collected in the interview truly reflect the final SAT participation rate, please try to construct the two-sided 90% confidence interval for the SAT participation rate.

Answer 1

Let be p and \(\hat{p}\) be respectively the SAT participation rate and its MLE estrimate.

As there are more than 5 successes and more than 5 failures, the confidence interval can be computed using the formula:

\[\hat{p}\pm z.\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\]

The point estimate for the population proportion is the sample proportion, and the margin of error is the product of the Z value for the desired confidence level (e.g., Z=1.645 for 90% confidence) and the standard error of the point estimate. In other words, the standard error of the point estimate is:

\[SE(\hat{p})=\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\] This formula is appropriate for samples with at least 5 successes and at least 5 failures in the sample. This was a condition for the Central Limit Theorem for binomial outcomes. If there are fewer than 5 successes or failures then alternative procedures, called exact methods, must be used to estimate the population proportion.

In our case, we have:

\[\hat{p}=\frac{80-8}{80}=\frac{72}{80}=\frac{9}{10}=0.9\] By substituting our values in the confidence interval we get:

\[0.9\pm 1.645.\sqrt{\frac{0.9(1-0.9)}{80}}=0.9 \pm 1.645(0.001125)\]

which is

\[0.9\pm 0.00185\]

So, the 90% interval is (0.898, 0.902)

Thus we are confident that the true proportion of student taking part the SAT survey is between 89.8% and 90.2%.

Question 2

2020 was the first time that our teaching team delivered all statistics teaching activities online and we are curious to know how the performance of students can be compared with that of semesters when on-campus teaching was the sole mode of delivery.

Therefore, we randomly pick 10 students from year 2019, and another 15 students from year 2020 to investigate. Assuming that the student’s score is normally distributed, i.e. the student’s score in 2019 is denoted as \(X \sim \mathcal{N}(\mu_{1}, \sigma_{1}^{2})\), and the student’s score in 2020 is denoted as \(Y \sim \mathcal{N}(\mu_{2}, \sigma_{2}^{2})\).

From the collected samples, we have \(\bar{x}=82\) , \(s_{x}^{2}= 56.5\), and \(\bar{y}=76\) , \(s_{y}^{2}= 52.4\) for 2019 and 2020 respectively. Using such information, please try your best in answering the following questions:

  1. If we know that \(\sigma_{1}^{2} = 64\), \(\sigma_{2}^{2} = 49\), please find the 95% confidence interval for \(\mu_{1}-\mu{2}\)

  2. If we only know that \(\sigma_{1}^{2} =\sigma_{2}^{2}\), please find the 95% confidence interval for \(\mu_{1}-\mu{2}\)

  3. If we are not sure about \(\sigma_{1}\) and \(\sigma_{2}\) , please find the 95% confidence interval for \(\mu_{1}-\mu{2}\)

Answer 2

(1) If we know that \(\sigma_{1}^{2} = 64\), \(\sigma_{2}^{2} = 49\), let’s find the 95% confidence interval for \(\mu_{1}-\mu{2}\)

The confidence interval for \(\mu_{1}-\mu_{2}\) is defined by : \(\bar{x_{1}}-\bar{x_{2}}\pm t_{1_-\dfrac{\alpha}{2}, \nu}\sqrt{\frac{\sigma_{1}^{2}}{n_{1}}+\frac{\sigma_{2}^{2}}{n_{2}}}\), with \(\nu = \frac{(\frac{\sigma_{1}^{2}}{n_{1}}+\frac{\sigma_{2}^{2}}{n_{2}})^{2}}{\frac{\sigma_{1}^{4}}{n_{1}^{3}}+\frac{\sigma_{2}^{4}}{n_{2}^{3}}}\).

We have for the point estimate:

\(\bar{x_{1}}-\bar{x_{2}}=\bar{x}-\bar{y}=82-76=6\).

Let’s determine the degree of freedom \(\nu\):

\(\nu = \frac{(\frac{\sigma_{1}^{2}}{n_{1}}+\frac{\sigma_{2}^{2}}{n_{2}})^{2}}{\frac{\sigma_{1}^{4}}{n_{1}^{3}}+\frac{\sigma_{2}^{4}}{n_{2}^{3}}}=\frac{(\frac{64}{10}+\frac{49}{15})^{2}}{\frac{64^{2}}{10^{3}}+\frac{49^{2}}{15^{3}}}=19.4375963020031\approx19\).

And the student statistic value read from the relevant student table is:

\(t_{1_-\dfrac{\alpha}{2}, \nu}=t_{1_-\dfrac{0.05}{2}, 19}=2.08983940395382\)

The standard error is :

\(\sqrt{\frac{\sigma_{1}^{2}}{n_{1}}+\frac{\sigma_{2}^{2}}{n_{2}}}=\sqrt{\frac{64}{10}+\frac{49}{15}}=3.1091264\)

Hence, the confidence interval for the students performances means difference is:

\(6 \pm 6.49757476025281\)

Let be : \(IC_{95}(\mu_{1}-\mu{2}) = [-0.4976\), \(12.4976]\) when \(\sigma_{1}^{2} = 64\), \(\sigma_{2}^{2} = 49\)

(2) If we only know that \(\sigma_{1}^{2} =\sigma_{2}^{2}\), let’s find the 95% confidence interval for \(\mu_{1}-\mu{2}\)

The confidence interval for \(\mu_{1}-\mu_{2}\) is defined by : \(\bar{x_{1}}-\bar{x_{2}}\pm t_{1_-\dfrac{\alpha}{2}, n_{1}+n_{2}-2}S_{p}\sqrt{\frac{1}{n_{1}}+\frac{1}{n_{2}}}\), with \(S_{p}=\frac{(n_{1}-1)s_{x}^{2}+(n_{2}-1)S_{y}^{2}}{n_{1}+n_{2}-2}\).

We have for the point estimate:

\(\bar{x_{1}}-\bar{x_{2}}=\bar{x}-\bar{y}=82-76=6\).

Let’s determine the degree of freedom :

\(n_{1}+n_{2}-2 = 10+15-2=23\).

And the student statistic value read from the relevant student table is:

\(t_{1_-\dfrac{\alpha}{2}, n_{1}+n_{2}-2}=t_{1_-\dfrac{0.05}{2}, 23}=2.06865761041905\)

The standard error is :

\(S_{p}\sqrt{\frac{1}{n_{1}}+\frac{1}{n_{2}}}=\frac{(n_{1}-1)s_{x}^{2}+(n_{2}-1)S_{y}^{2}}{n_{1}+n_{2}-2}\sqrt{\frac{1}{n_{1}}+\frac{1}{n_{2}}}=\frac{(10-1)56.5+(15-1)52.4}{10+15-2}\sqrt{\frac{1}{10}+\frac{1}{15}}=22.0472\)

Hence, the confidence interval for the students performances means difference is:

\(6 \pm 45.608072234349\)

Let be : \(IC_{95}(\mu_{1}-\mu{2}) = [-39.6081\), \(51.6081]\) when \(\sigma_{1}^{2} =\sigma_{2}^{2}\)

(3) If we are not sure about \(\sigma_{1}\) and \(\sigma_{2}\) , please find the 95% confidence interval for \(\mu_{1}-\mu{2}\)

The confidence interval for \(\mu_{1}-\mu_{2}\) is defined by : \(\bar{x_{1}}-\bar{x_{2}}\pm t_{1_-\dfrac{\alpha}{2}, \nu}\sqrt{\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}}}\), with \(\nu = \frac{(\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}})^{2}}{\frac{s_{1}^{4}}{n_{1}^{2}(n_{1}-1)}+\frac{s_{2}^{4}}{n_{2}^{2}(n_{2}-1)}}\).

We have for the point estimate:

\(\bar{x_{1}}-\bar{x_{2}}=\bar{x}-\bar{y}=82-76=6\).

Let’s determine the degree of freedom \(\nu\):

\(\nu = \frac{(\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}})^{2}}{\frac{s_{1}^{4}}{n_{1}^{2}(n_{1}-1)}+\frac{s_{2}^{4}}{n_{2}^{2}(n_{2}-1)}}=\frac{(\frac{64}{10}+\frac{49}{15})^{2}}{\frac{64^{2}}{10^{2}(10-1)}+\frac{49^{2}}{15^{2}(15-1)}}\approx 92\).

And the student statistic value read from the relevant student table is:

\(t_{1_-\dfrac{\alpha}{2}, \nu}=t_{1_-\dfrac{0.05}{2}, 92}=1.98597318749149\)

The standard error is :

\(\sqrt{\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}}}=\sqrt{\frac{56.5}{10}+\frac{52.4}{15}}=3.0237945\)

Hence, the confidence interval for the students performances means difference is:

\(6 \pm 6.00517485241868\)

Let be : \(IC_{95}(\mu_{1}-\mu{2}) = [-0.4976\), \(12.4976]\) when ${1}^{2} {2}^{2} $ are unknown.

Question 3

An Annotated/labeled data pool is one of the most important assets for a machine learner or data scientist. The more annotated data you can gather, the better your model estimations become towards the unknown variables (i.e. as we saw with the weak law of large numbers). As such, your team is going to label a new dataset for both future use and the improvement of your current model estimates. However, as this task is very arduous in nature, you want to budget for your time properly. You started by recording the amount of time it takes for you to label one record of data, and found that it took on average 5 minutes for you to do so. You also noted that there is a standard deviation of 5 minutes for the recording time as well. As this is a high priority task for your company, your boss has allowed you to spend 4 hours per day this week (only working days - Monday to Friday) to perform this annotation task. Please find out at most how many records would be labelled by you in a single week with a probability of at least 97.5% ?

Answer 3

Let be \((X_{i})_{i=1, ... , 5}\) the random variable associated to the number of records registered in 4 hours per day.

As we have: \(4 hours = 4 \times60minutes='r 4*60'minutes\), we got \(X_{i} \sim Exp(240)\)

And \(S_{n} = \sum_{i=1}^{5}X_{i} \sim \gamma(5, \frac{1}{240})\) for the five days of the week.

This involves that \(\frac{S_{n}}{240} \sim \gamma(5)\) and \(\frac{2S_{n}}{240} \sim \chi_{2n}^{2}\), with n = 5 for the five working days of the week.

So we can find a and b such as:

\[P(a \lt ^\lt b) = 1- \alpha\], where \(\theta\) is the random variable associated to the number of records labelled in a week.

Then, by deducing the confidence interval of level \(1 -alpha\) from that:

\[\frac{2S_{n}}{b} \lt \theta \lt \frac{2S_{n}}{a}\] we got an estimate of the number of records that would be labeled in a single week.

By picking \(\alpha=0.05\), we read in the table of the chi squared the fractiles of respective orders 0.025 and 0.975 of the law \(\chi_{10}^{2}\): a=3.25 and b=20.48.

Hence, the interval:

\[23.4375 \lt \theta \lt 147.6923077\] And 148 records would be at most labelled in a week.

Question 4

In lecture 3, we mentioned the use of the weak law of large numbers which tells us that the sample estimator will converge to the population parameter if we have sufficiently large number of observations (or sample size). In this question, we would like to see how big the sample size should be in order to get the approximation error down to a certain level.

Assuming that we have a Binomial random varibale \(X\sim Bin(n,0.8)\) , what is the smallest sample size n such that

\[P(∣X_{n}−0.8∣>0.1)<0.01\]

Answer 4

\(X\sim Bin(n,0.8) \implies X_{n} = \sum_{i=1}^{n}(X_{i}) \sim \mathcal{N}(0.8n, 0.16n)\) and {X_{n}} = {i=1}^{n}(X{i}) (0.8, 0.16$ By the law of large numbers combined to Chebyshev’s Inequality), we want n such as:

\[P(∣\bar{X_{n}}−\mu∣>\epsilon)<\frac{\sigma^{2}}{n\epsilon^{2}}\], with \(\mu=0.8, \epsilon=0.1, \sigma^{2}=0.16\)

Thus, we have:

\[P(∣\bar{X_{n}}−0.8∣>0.1)<\frac{0.16}{n(0.1)^{2}}\]

So for 99% confidence interval, we need :

\[\frac{0.16}{n(0.1)^{2}}=0.01\] Thus n should be:

\[\boxed{n= \frac{0.16}{0.01\times(0.1)^{2}}=1600}\]

Question 5

Please using central limit theorem to prove that:

\[lim_{n \rightarrow +\infty}(1+n+\frac{n^{2}}{2!}+...+\frac{n^{n}}{n!})e^{-n}=\frac{1}{2}\] Where \(n \in \mathbb{Z}^{+}\), and \(e\) stands for the euler number.

Answer 5:

Let \(S_{n} = (1+n+\frac{n^{2}}{n!}+...+\frac{n^{n}}{n!})e^{-n}=\sum_{i=0}^{n}\frac{n^{i}}{i!}e^{-n}\) be the sum of i. i. d. Poisson(1) random variables. Thus \(S_{n}\) is Poisson(n), and \(ES_{n}=n\). By the central limit theorem \(P(S_{n} \leq n) \rightarrow \frac{1}{2},\) but \(P(S_{n} \leq n)\) is exactly the expression in question. So the answer is \(\frac{1}{2}\).

Part 3 Hypothesis Testing

Question 1

To estimate the difficulty of the final exam, the teaching team has designed a mock exam that is at the same level of difficulty as that of the real exam.

If the average mark for this mock exam is lower than 60, that would indicate that the level of difficulty is too challenging for all students. If the average marks is higher than 70, then the difficulty level is considered as low.

In order to find out the difficulty level of the mock exam, the teaching team has randomly invited a sample of 20 students to take the mock exam. The average mark of the sample is 66, with a sample standard deviation of 10.

Supposing that the score is normally distributed, what hypothesis (or hypotheses) should the teaching team raise if they want to make sure that the level of difficulty for the exam is normal (i.e. the difficulty is not considered high or low)? Using hypothesis testing methods, determine if they can prove the hypothesis?

Answer 1

i) Hypotheses

Two assumptions should be tested: whether the test is too difficult and whether the test is too difficult. So two different tests should be conducted.

\(\alpha)\) Do the mockup exam too difficult?

Let be m the mean of the students performances to this mockup exam.

\[ \begin{cases} H_{0}: m \lt 60 \\ H_{1} : m \geq 60 \end{cases}\]

As the sample size, 20, is less than 30, the test statisic, \(T_{n} = \frac{\bar{x}-\mu_{0}}{\frac{s}{\sqrt(n)}}\) follows a student distribution, where \(\bar{x}\) is the sample mean, \(\mu_{0}=60\) and n is the sample size.

So, we have:

\(T_{n} \sim t_{1-\alpha, n -1}\).

and \(T_{20}=\frac{\bar{x}-\mu_{0}}{\frac{s}{\sqrt(n)}}=\frac{{66-60}}{\frac{10}{\sqrt(20)}}=2.6832816\)

Let be And we can read in the student table: \[t_{1-\alpha, n -1}=t_{1-0.05, 19}=1.7291328\]

Decision criteria for our test: We reject the null hypothesis, \(H_{0}\) if \(T_{n} \geq t_{1-\alpha, n -1}\)

In our case, we have \(T_{20} \gt t_{(1-0.05, 19)}\) because \(2.6832816 \gt 1.7291328\). So, we reject the null hypothesis: there isn’t enough statistical evidence, based on our sample, that the final exam is difficult.

\(\beta)\) Do the mockup exam too easy?

Let be m the mean of the students performances to this mockup exam.

\[ \begin{cases} H_{0}: m \gt 70 \\ H_{1} : m \leq 70 \end{cases}\]

As the sample size, 20, is less than 30, the test statisic, \(T_{n} = \frac{\bar{x}-\mu_{0}}{\frac{s}{\sqrt(n)}}\) follows a student distribution, where \(\bar{x}\) is the sample mean, \(\mu_{0}=70\) and n is the sample size.

So, we have:

\(T_{n} \sim t_{\alpha, n -1}\).

and \(T_{20}=\frac{\bar{x}-\mu_{0}}{\frac{s}{\sqrt(n)}}=\frac{{70-60}}{\frac{10}{\sqrt(20)}}=4.472136\)

And we can read in the student table: \[t_{\alpha, n -1}=t_{0.05, 19}=-1.7291328\]

Decision criteria for our test: We reject the null hypothesis, \(H_{0}\) if \(T_{n} \leq -t_{\alpha, n -1}\)

In our case, we have \(T_{20} \gt -t_{(0.05, 19)}\) because \(4.472136 \gt 1.7291328\). So, we accept the null hypothesis: there is enough statistical evidence, based on our sample, that the final exam is easy.

Question 2:

Yun has a crush on a girl who is a statistician. He wants to ask her out but is afraid of getting REJECTED. So before doing anything stupid, he would like to perform some simulations on how the date would go (either the date goes well or fails spectacularly) and base his final decision on the simulation results (yeah we know it is stupid too).

After running the simulation 10 times, he claims to you, his friend, that all the evidence pointed out that the date would go well and that he now feels confident in asking the girl out. However, as his friend, you don’t really want to let him make a fool of himself so you secretly check his model accuracy and theorize that the model has a 0.3 probability of making erroneous predictions based on what you have seen, i.e., the model says that the date would go well but then it does not go well in reality. Knowing this vital piece of information, do you think that he should still ask the girl out? (Hint: You can base your hypothesis on your estimate)

Answer 2

Let’s determine if Yun should ask the girl out:

Let be \((X_{i})_{i \in \{1, ..., 10\}}=\begin{cases}1 \,& if \,the \,girl \,ACCEPTs \,Yun \\ 0 \,& if \,the \,girl \,REJECTs \,Yun \end{cases}\) the randoms variables associated to the outcome of each of Yun’s simulations.

We have \(X_{i} \sim Be(p)\), with \(\hat{p}=1-0.3=0.7\).

Let’s build a test to check if there is enough statistical evidence about this probability.

Let’s assume hat if \(p \lt 0.5\), Yun would get REJECTED.

So, the assumptions of our test are: \[ \begin{cases} H_{0}: p \leq 0.5 \\ H_{1} : p \gt 0.5 \end{cases}\]

Let’s determine a critical region for our test

\(S_{n}=\sum_{i=1}^{n}X_{i} \sim Bin(n, p)\), with n=10 and p is unknown

if \(np\gt5\) and \(n(1-p)\gt5\), we can approximate this binomial distribution. As we don’t know the value of p and really want to help Yun, we will use this approximation anyway.

And \(S_{n} \sim \mathcal{N}(np, \sqrt{np(1-p)}) \implies S_{n} \sim \mathcal{N}(m, \sigma)\), with m = 10p and \(\sigma=\sqrt{10p(1-p)}\)

The original test is now equivalent to:

\[ \begin{cases} H_{0}: m \leq 5 \\ H_{1} : m \gt 5 \end{cases}\] The both hypotheses are multiples. Let’s determine first the critical region of the test between the two simple hypotheses:

\[ \begin{cases} H_{0}: m = m_{0} \\ H_{1} : m \neq m_{1} \end{cases}\] with any \(m_{0}\) and \(m_{1}\) satisfying \(m_{0} \leq 5\) and \(m_{1} \gt 5\). We write the likelihood as:

\[L(x_{1}, ..., x_{n}; m) = \frac{1}{\sigma\sqrt{2\pi}}exp\{-\frac{1}{2\sigma^{2}}\sum_{i=1}^{n}(x_{i}-m)^{2}\}\] The form of the critical region given by Neyman and Pearson theorem, is \(\frac{L_{0}}{L_{1}}\leq k\), what by taking the the logarithms lead to the inequality:

$$ \[\begin{align} \frac{L_{0}}{L_{1}}\leq k & \implies -\frac{1}{2\sigma^{2}}\sum_{i=1}^{n}[(x_{i}-m_{0})^{2}-(x_{i}-m_{1})^{2}] \leq lnk\\ & \implies -\frac{1}{2\sigma^{2}}\sum_{i=1}^{n}[-2(m_{0}-m_{1})x_{i}] \leq lnk\\ & \implies \frac{1}{\sigma^{2}}(m_{0}-m_{1})\bar{x_{n}}] \leq lnk, \, with \, \bar{x_{n}}=\frac{1}{n}\sum_{i=1}^{n}x_{i}\\ & \implies \bar{x_{n}} \geq \frac{\sigma^{2}lnk}{m_{1}-m_{0}} \end{align}\] $$

This critical region is independent of the values of \(m_{0}\) and \(m_{1}\) which always satisfy \(m_{1}-m_{0}\gt0\). So it is the one of the uniformly most powerful test for \(H_{0}: m \leq 5\) against \(H_{1}: m \gt 5\) and is defined by:

\[\bar{x_{n}}\geq C\] The value of the constant C is determined by:

\[\alpha = P(W\mid m=5) = P(\bar{X_{n}}\geq C\mid m=5)\] For m = 5, the empiric mean \(\bar{X_{n}}\sim \mathcal{N}(5, \sigma/\sqrt{n})\), so by centering and reducing, we get the condition:

\[\alpha = P(U\geq\frac{C-5}{\sigma/\sqrt{n}})\] where U is a r.v. of law \(\mathcal{N}(0, 1)\). Thus the constant C is defined by:

\[C=5+\frac{\sigma u}{\sqrt{n}}=5+u\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\] where u is the \(1-\alpha\) quantile order of the law \(\mathcal{N}(0, 1)\). For a risk level \(\alpha=0.05\) we read in the normal law table the quantile u = 1.6449 hence a critical region is defined for n=10 by:

\[W=\{x_{1}, ... , x_{10}\mid \bar{x_{10}}\geq5+1.65\sqrt{\frac{0.7(1-0.7)}{10}}\}=\{x_{1}, ... , x_{10}\mid \bar{x_{10}}\geq 5.2391077\}\] In our case, \(\bar{x_{10}}=10*0.7=7\), which belongs to the critical region. There is enough statistical evidence that his model is significant. So Yun should ask the girl out.

Question 3

An online retailler is testing the new design of their banner via A/B testing. When the user goes to the retailer website, he/she is randomly assigned with either the old or the new banner (the probability that the user sees either banner is the same).

One of the metrics to measure users’ fondness of the banner design is by how many times the banner was “clicked” on by the users, the more clicks/interactions between the users and the banner, the more likely that the users are happy with the design.

For this reason, the retailer has collected the number of interactions between the users and the banners (both the old and the new design) for the past 14 days and presented the results as follows:

Old banner: (1145, 776, 1002, 986, 1108, 1088, 891, 1071, 1014, 1071, 2012, 1944, 2206, 2125)

New banner: (1332, 1390, 1000, 1184, 1240, 1400, 1026, 1281, 1314, 1067, 2332, 1972, 2385, 2299)

Knowing that there was a promotion which started from the 11th day, can you prove that the new banner design is better, and does it depend on whether a promotion is running? How do you interpret the results?

Answer 3

i) Is the new design banner better?

Let’s realize a test for the difference of the means of the number of click recorded for the new banner vs the old banner:

Our assumption is that there is no difference, then we have the hypotheses:

\[ \begin{cases} H_{0}: \mu_{new\_banner}-\mu_{old\_banner}=0 \\ H_{1} : \mu_{new\_banner}-\mu_{old\_banner}\neq0 \end{cases}\]

As the sample size, 14, is less than 30, the test statisic, \(T_{n} = \frac{(\mu_{new\_banner}-\mu_{old\_banner})-0}{\sqrt{\frac{s_{new\_banner}^{2}}{n_{new\_banner}}+\frac{s_{old\_banner}^{2}}{n_{old\_banner}}}}\) follows a student distribution.

So, we have:

\(T_{n} \sim t_{1-\alpha/2, \nu}\), with \(\nu = \frac{(\frac{s_{new\_banner}^{2}}{n_{new\_banner}}+\frac{s_{old\_banner}^{2}}{n_{old\_banner}})^{2}}{\frac{s_{new\_banner}^{4}}{n_{new\_banner}^{2}(n_{new\_banner}-1)}+\frac{s_{old\_banner}^{4}}{n_{old\_banner}^{2}(n_{old\_banner}-1)}}\)

Based on the sample data, we have:

\(\mu_{new\_banner} = 1317.071\), \(\mu_{old\_banner} = 1515.857\), \(s^{2}_{new\_banner} = 256986.8\), \(s^{2}_{old\_banner} = 253508.9\) and \(n_{1}=n_{2}=n=14\)

So \(T_{14}=\frac{(\mu_{new\_banner}-\mu_{old\_banner})-0}{\sqrt{\frac{s_{new\_banner}^{2}}{n_{new\_banner}}+\frac{s_{old\_banner}^{2}}{n_{old\_banner}}}}=\frac{(1317.071-1515.857)-0}{\sqrt{\frac{256986.8}{14}+\frac{253508.9}{14}}}=-1.0410073\)

and \(\nu = \frac{(\frac{s_{new\_banner}^{2}}{n_{new\_banner}}+\frac{s_{old\_banner}^{2}}{n_{old\_banner}})^{2}}{\frac{s_{new\_banner}^{4}}{n_{new\_banner}^{2}(n_{new\_banner}-1)}+\frac{s_{old\_banner}^{4}}{n_{old\_banner}^{2}(n_{old\_banner}-1)}}=\frac{(\frac{256986.8}{14}+\frac{253508.9}{14})^{2}}{\frac{256986.8^{2}}{14^{2}(14-1)}+\frac{253508.9^{2}}{14^{2}(14-1)}}=25.9987933\)

And we can read in the student table: \[t_{1-\alpha/2, \nu}=t_{1-0.025, 26}=2.0555294\]

Decision criteria for our test: We reject the null hypothesis, \(H_{0}\) if \(|T_{n}| \geq t_{1-\alpha/2, \nu}\)

In our case, we have \(|T_{20}| \lt t_{(1-0.025, 26)}\) because \(1.04 \gt 2.06\). So, we accept the null hypothesis: there is enough statistical evidence, based on our sample, that the average number of clicks while using the new banner differs from the one obtained while using the old banner.

i) Do the number of clicks depends on if the promotion is running or not?

Let’s realize a test for the difference of the means of the number of click recorded for the before the promotion and during:

Our assumption is that there is no difference, then we have the hypotheses:

\[ \begin{cases} H_{0}: \mu_{Promo}-\mu_{NoPormo}=0 \\ H_{1} : \mu_{Promo}-\mu_{NoPormo}\neq0 \end{cases}\]

As the sample size, is less than 30, the test statisic, \(T_{n} = \frac{(\mu_{Promo}-\mu_{NoPromo})-0}{\sqrt{\frac{s_{Promo}^{2}}{n_{Promo}}+\frac{s_{NoPromo}^{2}}{n_{NoPromo}}}}\) follows a student distribution.

So, we have:

\(T_{n} \sim t_{1-\alpha/2, \nu}\), with \(\nu = \frac{(\frac{s_{Promo}^{2}}{n_{Promo}}+\frac{s_{NoPromo}^{2}}{n_{NoPromo}})^{2}}{\frac{s_{Promo}^{4}}{n_{Promo}^{2}(n_{Promo}-1)}+\frac{s_{NoPromo}^{4}}{n_{NoPromo}^{2}(n_{NoPromo}-1)}}\)

Based on the sample data, we have:

\(\mu_{Promo} = 2159.375\), \(\mu_{NoPromo} = 1119.3\), \(s^{2}_{Promo} = 29538.84\), \(s^{2}_{NoPromo} = 27622.12\) and \(n_{1}=20\, n_{2}=8\)

So \(T_{14}=\frac{(\mu_{Promo}-\mu_{NoPromo})-0}{\sqrt{\frac{s_{Promo}^{2}}{n_{Promo}}+\frac{s_{NoPromo}^{2}}{n_{NoPromo}}}=\frac{(2159.375-1119.3)-0}{\sqrt{\frac{29538.84}{8}}+\frac{27622.12}{20}}}=14.6020051\)

and \(\nu = \frac{(\frac{s_{Promo}^{2}}{n_{Promo}}+\frac{s_{NoPromo}^{2}}{n_{NoPromo}})^{2}}{\frac{s_{Promo}^{4}}{n_{Promo}^{2}(n_{Promo}-1)}+\frac{s_{NoPromo}^{4}}{n_{NoPromo}^{2}(n_{NoPromo}-1)}}=\frac{(\frac{29538.84}{8}+\frac{27622.12}{20})^{2}}{\frac{29538.84^{2}}{8^{2}(8-1)}+\frac{27622.12^{2}}{20^{2}(20-1)}}=24.9276929\)

And we can read in the student table: \[t_{1-\alpha/2, \nu}=t_{1-0.025, 25)}=2.0595386\]

Decision criteria for our test: We reject the null hypothesis, \(H_{0}\) if \(|T_{n}| \geq t_{1-\alpha/2, \nu}\)

In our case, we have \(|T_{20}| \gt t_{(1-0.025, 25)}\) because \(14.60 \gt 2.06\). So, we reject the null hypothesis: there is not enough statistical evidence, based on our sample, that the average before promotion differs from the one obtained during promotion. This result certainly affect the true effect of banner chage on the number of clicks. Another test, like ANOVA could be run to find the true effect.

Part 4: Linear Regression.

After 1000 years, due to human activities, the earth’s ecology has been irreversibly destroyed; hence, the earthlings seek to emigrate to a new earth-like planet, Kepler 22b. Kepler 22b (also known as Kepler Target Object KOI-087.01) is an exoplanet that orbits in the habitable zone of the sun-like star Kepler 22. It is located in the constellation Cygnus, about 600 light years (180 pc) from Earth, and was discovered by NASA’s Kepler Space Telescope in December 2011.

In 3011 , humans discovered that this planet has a composition similar to that of the earth and the water on which all animals and plants depend. Later, scientists conducted many field studies and proposed that we can migrate to this planet, with the migration process starting as soon as 3030 . Humanity has since built many cities on Kepler 22b and completed the migration process by 3100 using the technological advanced ultra-high-speed space shuttle .

Suppose that you live in 4118 , and are one of the many data scientists working on the Kepler 22b climate problem. After many expansions made on the planet, mankind wants to prevent Kepler 22b from repeating the same mistakes as that of the earth; thus, a tremendous amount of resources has been put towards the study of the sulfur dioxide level ( SO2 ) on Kepler 22b because high SO2 levels cause acid rain.

You have been provided with three datasets, Regression_train.csv, Regression_test.csv, and Regression_new.csv. Using these datasets, you hope to build a model that can predict the SO2 level on this new planet using other variables. Regression_train.csv and Regression_new.csv come with the ground-truth target label SO2 whereas Regression_test.csv comes with independent variables (input information) only.

The information of the attributes for these datasets can be found below:

year: year of data in this row month: month of data in this row day: day of data in this row hour: hour of data in this row PM2.5: PM2.5 concentration (ug/m^3) PM10: PM10 concentration (ug/m^3) NO2: NO2 concentration (ug/m^3) CO: CO concentration (ug/m^3) TEMP: temperature (degree Celsius). O3: O3 concentration (ug/m^3) PRES: pressure (hPa) DEWP: dew point temperature (degree Celsius) RAIN: precipitation (mm) wd: wind direction WSPM: wind speed (m/s) station: name of the air-quality monitoring site SO2 : SO2 concentration (ug/m^3) PLEASE NOTE THAT THE USE OF LIBRARIES ARE PROHIBITED IN THESE QUESTIONS UNLESS STATED OTHERWISE, ANSWERS USING LIBRARIES WILL RECEIVE 0 MARKS

Question 1

Please load the Regression_train.csv and fit a multiple linear regression model with SO2 being the target variable. According to the summary table, which predictors do you think are possibly associated with the target variable (use the significance level of 0.01), and which are the Top 10 strongest predictors? Please write an R script to automatically fetch and print this information.

NOTE: Manually doing the above tasks will result in 0 marks.

Answer 1

# ANSWER BLOCK
# Read in the data here
train <- read.table("../Data/Regression_train.csv", sep=",", header=TRUE)


# Build the multiple linear regression model here
lm.fit <- lm(SO2 ~ ., data=train,)

# Get the summary of the model here
fit.summary <- summary(lm.fit) 

# Write the function to get the important predictors as well as the top 10 strongest predictors:
top.predictors <- function(fit.summary) {
    # Getting the important predictors
    coef.imp_with_intercept <- fit.summary["coefficients"][,4][fit.summary["coefficients"][,4]<0.01]
    
    coef.imp <- coef.imp_with_intercept[c(2:length(coef.imp_with_intercept))]
    
    # Getting the top 10 predictors
    coef.most <- sort(coef.imp)[c(1:10)]
    
    # Printing out the results, you can keep this format or make some
    #format that looks better
    print(paste("The important features are: ", names(coef.imp)))
    print(paste("The top 10 most important features are: ", names(coef.most)))
  }

Question 2

Rather than calling the lm() function, you would like to write your own function to do the least square estimation for the simple linear regression model parameters \(β_{0}\) and \(β_{1}\). The function takes two input arguments with the first being the dataset name and the second the predictor name, and outputs the fitted linear model with the form: \(E[SO2]=β_{0}+β_{1}x\)

Code up this function in R and apply it to the two predictors TEMP and CO separately, and explain the effect that those two variables have on SO2 .

Answer 2

# ANSWER BLOCK
# Least squared estimator function
lsq <- function(dataset, predictor){
  # INSERT YOUR ANSWER IN THIS BLOCK
  # Get the final estimators
  beta_1 <- cov(dataset[predictor][[1]], dataset$SO2)/var(dataset[predictor][[1]])
  beta_0 <- mean(dataset$SO2)-beta_1*mean(dataset[predictor][[1]])
  # Return the results:
  return(paste0('E[SO2]=', beta_0,'+', beta_1,'*',predictor))
}
print(lsq(train, 'TEMP'))
## [1] "E[SO2]=23.3685621283906+-0.585366593397881*TEMP"
print(lsq(train, 'CO'))
## [1] "E[SO2]=3.65554016974754+0.00967776157015756*CO"

Question 3

R squared from the summary table reflects that the full model doesn’t fit the training dataset well; thus, you try to quantify the error between the values of the ground-truth and those of the model prediction. You want to write a function to predict \(SO_{2}\) with the given dataset and calculate the root mean squared error (rMSE) between the model predictions and the ground truths. Please test this function on the full model and the training dataset.

Answer 3

# ANSWER BLOCK
rmse <- function(model, dataset){
  residual <- na.omit(dataset$SO2) - predict.lm(model, newdata = dataset, na.action=na.omit)
  
  return (sqrt(mean(residual^2, na.rm=TRUE)))
}

print(rmse(lm.fit, train))
## [1] 14.99333

Question 4

You find the full model complicated and try to reduce the complexity by performing bidirectional stepwise regression with BIC. Please briefly explain what forward stepwise regression does in less than 100 words.

Calculate the rMSE of this new model with the function that you implemented previously.

Answer 4

# ANSWER BLOCK
sw.fit <- step(lm.fit,)
## Start:  AIC=1137455
## SO2 ~ year + month + day + hour + PM2.5 + PM10 + NO2 + CO + O3 + 
##     TEMP + PRES + DEWP + RAIN + wd + WSPM + station
## 
##           Df Sum of Sq      RSS     AIC
## - day      1         0 47215402 1137453
## - PM10     1         0 47215403 1137453
## <none>                 47215402 1137455
## - RAIN     1      4455 47219858 1137473
## - hour     1     10811 47226214 1137501
## - WSPM     1     16287 47231690 1137525
## - PRES     1     30098 47245501 1137587
## - TEMP     1     63684 47279087 1137736
## - PM2.5    1    202831 47418234 1138353
## - station 10    319887 47535289 1138853
## - O3       1    332371 47547774 1138926
## - wd      16    592188 47807590 1140041
## - DEWP     1   1552048 48767451 1144246
## - NO2      1   1629436 48844838 1144579
## - CO       1   2242911 49458313 1147201
## - month    1   2861303 50076705 1149810
## - year     1   4910741 52126144 1158235
## 
## Step:  AIC=1137453
## SO2 ~ year + month + hour + PM2.5 + PM10 + NO2 + CO + O3 + TEMP + 
##     PRES + DEWP + RAIN + wd + WSPM + station
## 
##           Df Sum of Sq      RSS     AIC
## - PM10     1         0 47215403 1137451
## <none>                 47215402 1137453
## - RAIN     1      4455 47219858 1137471
## - hour     1     10813 47226216 1137499
## - WSPM     1     16287 47231690 1137523
## - PRES     1     30289 47245692 1137586
## - TEMP     1     63692 47279095 1137734
## - PM2.5    1    203002 47418404 1138352
## - station 10    319992 47535395 1138852
## - O3       1    332392 47547795 1138924
## - wd      16    592206 47807609 1140039
## - DEWP     1   1554995 48770398 1144257
## - NO2      1   1630093 48845496 1144580
## - CO       1   2246556 49461959 1147214
## - month    1   2861528 50076930 1149809
## - year     1   4911048 52126451 1158234
## 
## Step:  AIC=1137451
## SO2 ~ year + month + hour + PM2.5 + NO2 + CO + O3 + TEMP + PRES + 
##     DEWP + RAIN + wd + WSPM + station
## 
##           Df Sum of Sq      RSS     AIC
## <none>                 47215403 1137451
## - RAIN     1      4455 47219858 1137469
## - hour     1     10813 47226216 1137497
## - WSPM     1     16484 47231886 1137522
## - PRES     1     30658 47246060 1137585
## - TEMP     1     63967 47279369 1137733
## - station 10    320373 47535776 1138851
## - O3       1    332696 47548098 1138924
## - PM2.5    1    408944 47624347 1139260
## - wd      16    592270 47807673 1140037
## - DEWP     1   1584140 48799543 1144380
## - NO2      1   1713576 48928979 1144937
## - CO       1   2247383 49462786 1147216
## - month    1   2861534 50076936 1149807
## - year     1   4912037 52127439 1158236
summary(sw.fit)
## 
## Call:
## lm(formula = SO2 ~ year + month + hour + PM2.5 + NO2 + CO + O3 + 
##     TEMP + PRES + DEWP + RAIN + wd + WSPM + station, data = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -94.83  -7.63  -1.40   5.39 487.05 
## 
## Coefficients:
##                      Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)         1.752e+04  1.185e+02  147.860  < 2e-16 ***
## year               -4.384e+00  2.966e-02 -147.806  < 2e-16 ***
## month              -1.229e+00  1.089e-02 -112.813  < 2e-16 ***
## hour               -3.579e-02  5.160e-03   -6.935 4.08e-12 ***
## PM2.5               3.253e-02  7.628e-04   42.647  < 2e-16 ***
## NO2                 1.526e-01  1.747e-03   87.300  < 2e-16 ***
## CO                  5.316e-03  5.317e-05   99.977  < 2e-16 ***
## O3                  3.617e-02  9.404e-04   38.467  < 2e-16 ***
## TEMP                1.424e-01  8.441e-03   16.867  < 2e-16 ***
## PRES                7.513e-02  6.434e-03   11.677  < 2e-16 ***
## DEWP               -5.073e-01  6.044e-03  -83.938  < 2e-16 ***
## RAIN                1.840e-01  4.134e-02    4.451 8.54e-06 ***
## wdE                 2.752e+00  5.488e-01    5.015 5.31e-07 ***
## wdENE               2.494e+00  5.471e-01    4.559 5.15e-06 ***
## wdESE               3.807e+00  5.523e-01    6.894 5.44e-12 ***
## wdN                 1.745e-02  5.477e-01    0.032  0.97458    
## wdNE                2.725e+00  5.440e-01    5.009 5.47e-07 ***
## wdNNE               1.797e+00  5.487e-01    3.275  0.00106 ** 
## wdNNW               6.830e-02  5.520e-01    0.124  0.90151    
## wdNW                3.454e-01  5.491e-01    0.629  0.52932    
## wdS                 5.543e+00  5.571e-01    9.950  < 2e-16 ***
## wdSE                5.010e+00  5.547e-01    9.033  < 2e-16 ***
## wdSSE               5.355e+00  5.585e-01    9.588  < 2e-16 ***
## wdSSW               5.733e+00  5.555e-01   10.321  < 2e-16 ***
## wdSW                4.148e+00  5.517e-01    7.518 5.60e-14 ***
## wdW                 2.198e+00  5.605e-01    3.921 8.81e-05 ***
## wdWNW               4.779e-01  5.547e-01    0.862  0.38893    
## wdWSW               3.294e+00  5.580e-01    5.903 3.58e-09 ***
## WSPM               -2.975e-01  3.475e-02   -8.562  < 2e-16 ***
## stationBarkville    2.864e+00  1.541e-01   18.585  < 2e-16 ***
## stationBeninsula    4.268e+00  1.618e-01   26.377  < 2e-16 ***
## stationGaulfield    3.718e+00  1.561e-01   23.817  < 2e-16 ***
## stationGlayton      2.708e+00  1.527e-01   17.738  < 2e-16 ***
## stationKippsland    1.490e+00  1.558e-01    9.564  < 2e-16 ***
## stationMerribee     6.819e-01  1.547e-01    4.407 1.05e-05 ***
## stationPerwick      3.122e+00  1.510e-01   20.679  < 2e-16 ***
## stationSuchou       3.136e+00  1.518e-01   20.655  < 2e-16 ***
## stationWalaysia     3.641e+00  1.611e-01   22.596  < 2e-16 ***
## stationWurrumbeena  2.961e+00  1.563e-01   18.951  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.99 on 209994 degrees of freedom
## Multiple R-squared:  0.4944, Adjusted R-squared:  0.4943 
## F-statistic:  5403 on 38 and 209994 DF,  p-value: < 2.2e-16
print(rmse(sw.fit, train))
## [1] 14.99333

Question 5

You have been given a new dataset Regression_new.csv that has the latest data of Kepler 22b. You are going to apply the new model sw.fit on the new dataset to evaluate the model performance with using rMSE. When you look into rMSE, what do you find? If you think sw.fit works well on Regression_new.csv, please explain why it does well. Otherwise, if you think your model sw.fit doesn’t perform well on Regression_new.csv, can you point out the potential reason(s) for this?

Answer 5

# ANSWER BLOCK
new <- read.table("../Data/Regression_new.csv", sep=",", header=TRUE)#, na="") # Reading in the new dataset

print(rmse(sw.fit, new)) # Finding out the rMSE of the sw.fit model with respect to the new dataset
## [1] 437.9666

The model sw.fit doesn’t perform well on new data. This is due to overfitting. It is too much attached to the train set and/or there are too many features in the model.

Question 6

Although stepwise regression has reduced the model complexity significantly, the model still contains a lot of variables that we want to remove. Therefore, you are interested in lightweight linear regression models with ONLY TWO predictors. Write a script to automatically find the best lightweight model which corresponds to the model with the least rMSE on the training dataset (Not the new dataset). Compare the rMSE of the best lightweight model with the rMSE of the full model - lm.fit - that you built previously. Give an explanation for these results based on consideration of the predictors involved.

Answer 6

# ANSWER BLOCK
# Some variables that you would want to initialize
minimum_error = 0
features = combn(names(train)[names(train)!="SO2"], 2)

send_features <- function(index){
  feature1 = features[1, index]
  feature2 = features[2, index]
  return (c(feature1, feature2))
}

train_model <- function(index){
  f1 <- send_features(index)[1]
  f2 <- send_features(index)[2]
  lm.fit <- lm(SO2 ~ . , data = train[c(f1, f2, "SO2")])
  
  return (lm.fit)
}

send_error <- function(index){
  return (rmse(train_model(index), train))
}

errors <- lapply(seq(1, 120), send_error)

features_ <- features[, which.min(errors)]

minimum_error <- errors[which.min(errors)]
# CODE HERE
print(paste('The best features are', features_, '; and the MSE is', minimum_error))
## [1] "The best features are month ; and the MSE is 17.0783268544614"
## [2] "The best features are CO ; and the MSE is 17.0783268544614"

The rMSE of the lightweight model is a little bit greater than the one of the previous models. This due to the fact that there are les variables in the lightweight model than in the previous ones: it is a kind of imformation loss.

Question 7

Rather than looking into rMSE, you want to build a lightweight linear regression model with ONLY TWO predictors which has the highest R squared. Write a script to automatically find the best lightweight model which corresponds to the model with the highest R squared on the training dataset (Not the new dataset).

Furthermore, please compare the two predictors in the best lightweight model found in the previous question and those of this question. Are the two predictors in each case different? If they do differ, please explain why?

Answer 7

# ANSWER BLOCK
# Some variables that you would want to initialize
maximum_rsquared = 1
features = combn(names(train)[names(train)!="SO2"], 2)
# CODE HERE

send_r_squared <- function(index){
  return (summary(train_model(index))$r.squared)
}

r_squareds <- lapply(seq(1, 120), send_r_squared)

which.max(r_squareds)
## [1] 21
maximum_rsquared <- r_squareds[which.max(r_squareds)]

print(paste('The best features are', features[,which.max(r_squareds)], '; and the rSquared i
s', maximum_rsquared))
## [1] "The best features are month ; and the rSquared i\ns 0.343958840257592"
## [2] "The best features are CO ; and the rSquared i\ns 0.343958840257592"

No, the predictors aren’t different.

Question 8 (Libraries are allowed)

As a Data Scientist, one of the key tasks is to build models most appropriate/closest to the truth; thus, modelling will not be limited to the aforementioned steps in this assignment. Thus, you will be graded by the rMSE performance of your model, the better your model, the higher your score. Additionally, you need to describe/document your thought process in this model building process, this is akin to showing your working properly for the mathematic sections. If you don’t clearly document the reasonings behind the model you use, we will have to make some deductions on your scores.

When you optimize your model’s performance, you can use any supervised model that you know and feature selection might be a big help as well.

Note Please make sure that we can install the libraries that you use in this part, the code structure can be:

install.packages(“some package”, repos=‘http://cran.us.r-project.org’)

library(“some package”)

Remember that if we cannot run your code, we will have to give you a deduction. Our suggestion is for you to use the standard R version 3.6.1

You also need to name your final model fin.mod so we can run a check to find out your performance. A good test for your understanding would be to set the previous BIC model to be the final model to check if your code works perfectly.

Answer 8

#Libraries

packages_list <- c("knitr", "gplot2","plyr",
                   "dplyr", "corrplot", "caret",
                   "gridExtra", "scales", "Rmisc", "ggrepel",
                   "randomForest", "psych", "xgboost","HelpersMG", "glmnet",
                   "CatEncoders", "dplyr", "tibble")

i_p <- installed.packages()

length(i_p)
## [1] 20976
new.packages <- packages_list[!(packages_list %in% i_p[,"Package"])]

length(new.packages)
## [1] 1
if(length(new.packages)) install.packages(new.packages)
## Installing package into '/home/capitaindata/R/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)
## Warning: package 'gplot2' is not available for this version of R
## 
## A version of this package for your version of R might be available elsewhere,
## see the ideas at
## https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages
## For catboost package, you need to download the released version 1.0 on github
## and install it manually or run the below lines:
library(HelpersMG)
## Loading required package: lme4
## Loading required package: Matrix
## Loading required package: coda
## Loading required package: MASS
new.package <- c("catboost")[!(c("catboost") %in% i_p[,"Package"])]

if(length(new.package)){
  wget("https://github.com/catboost/catboost/releases/download/v1.0.0/catboost-R-Linux-1.0.0.tgz")
  install.packages("catboost-R-Linux-1.0.0.tgz", repos = NULL, type = .Platform$pkgType)
} 

library(knitr)
library(ggplot2)
library(plyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following object is masked from 'package:MASS':
## 
##     select
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(corrplot)
## corrplot 0.84 loaded
library(caret)
## Loading required package: lattice
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(scales)
library(Rmisc)
library(ggrepel)
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(psych)
## 
## Attaching package: 'psych'
## The following object is masked from 'package:randomForest':
## 
##     outlier
## The following objects are masked from 'package:scales':
## 
##     alpha, rescale
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
## The following object is masked from 'package:HelpersMG':
## 
##     logit
library(xgboost)
## 
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
## 
##     slice
library(catboost)
library(glmnet)
## Loaded glmnet 4.1
library(CatEncoders)
## 
## Attaching package: 'CatEncoders'
## The following object is masked from 'package:base':
## 
##     transform
library(dplyr)
library(tibble)

# Models selection: catboost, xgboost, lasso, ridge, and elasticnet
###We will use the original data, not preprocessed here because it is less ressources
###usage intensive
#define response variable
y <- train$SO2

#define matrix of predictor variables
x <- data.matrix(train[, c('year', 'month', 'day', 'hour', 'PM2.5', 'PM10', 'NO2', 'CO',
                           'O3', 'TEMP', 'PRES', 'DEWP', 'RAIN', 'wd', 'WSPM', 'station')])

# To fix possible multicolinearity
x2 <- data.matrix(train[, c('year', 'month', 'day', 'hour', 'PM2.5', 'PM10', 'NO2',
                           'O3', 'TEMP', 'PRES', 'DEWP', 'RAIN', 'wd', 'WSPM', 'station')])

new_y <- new$SO2

#define matrix of predictor variables
new_x <- data.matrix(new[, c('year', 'month', 'day', 'hour', 'PM2.5', 'PM10', 'NO2', 'CO',
                           'O3', 'TEMP', 'PRES', 'DEWP', 'RAIN', 'wd', 'WSPM', 'station')])

new_x2 <- data.matrix(new[, c('year', 'month', 'day', 'hour', 'PM2.5', 'PM10', 'NO2',
                             'O3', 'TEMP', 'PRES', 'DEWP', 'RAIN', 'wd', 'WSPM', 'station')])

## Lasso
#perform k-fold cross-validation to find optimal lambda value
cv_model <- cv.glmnet(x, y, alpha = 1)

#find optimal lambda value that minimizes test MSE
best_lambda <- cv_model$lambda.min
best_lambda
## [1] 0.01257658
#produce plot of test MSE by lambda value
plot(cv_model) 

#find coefficients of best model
lasso_model <- glmnet(x, y, alpha = 1, lambda = best_lambda)

#Assess errors
RMSE(predict(lasso_model, s = best_lambda, newx = new_x), new_y)
## [1] 437.0371
RMSE(predict(lasso_model, s = best_lambda, newx = x), y)
## [1] 15.12651
## Ridge
#perform k-fold cross-validation to find optimal lambda value
ridge_cv_model <- cv.glmnet(x, y, alpha = 0)

#find optimal lambda value that minimizes test MSE
ridge_best_lambda <- ridge_cv_model$lambda.min
ridge_best_lambda
## [1] 1.119586
#produce plot of test MSE by lambda value
plot(ridge_cv_model) 

#find coefficients of best model
ridge_model <- glmnet(x, y, alpha = 0, lambda = ridge_best_lambda)

#Assess errors
RMSE(predict(ridge_model, s = ridge_best_lambda, newx = new_x), new_y)
## [1] 412.6456
RMSE(predict(ridge_model, s = ridge_best_lambda, newx = x), y)
## [1] 15.15738
## Elasticnet
#perform k-fold cross-validation to find optimal lambda value
elasticnet_cv_model <- cv.glmnet(x, y, alpha = 0.5)

#find optimal lambda value that minimizes test MSE
elasticnet_best_lambda <- elasticnet_cv_model$lambda.min
elasticnet_best_lambda
## [1] 0.02291863
#produce plot of test MSE by lambda value
plot(elasticnet_cv_model) 

#find coefficients of best model
elasticnet_model <- glmnet(x, y, alpha = 0.5, lambda = elasticnet_best_lambda)

#Assess errors
RMSE(predict(elasticnet_model, s = elasticnet_best_lambda, newx = new_x), new_y)
## [1] 436.894
RMSE(predict(elasticnet_model, s = elasticnet_best_lambda, newx = x), y)
## [1] 15.12653
cat("
    So far Ridge (412.64 rms) regression appears to be the one better performing on new data
    over elasticet(436) and lasso(437)
    ")
## 
##     So far Ridge (412.64 rms) regression appears to be the one better performing on new data
##     over elasticet(436) and lasso(437)
## 
## Ridge 2
#perform k-fold cross-validation to find optimal lambda value
ridge_cv_model <- cv.glmnet(x2, y, alpha = 0)

#find optimal lambda value that minimizes test MSE
ridge_best_lambda <- ridge_cv_model$lambda.min
ridge_best_lambda
## [1] 1.058009
#produce plot of test MSE by lambda value
plot(ridge_cv_model) 

#find coefficients of best model
ridge_model <- glmnet(x2, y, alpha = 0, lambda = ridge_best_lambda)

#Assess errors
RMSE(predict(ridge_model, s = ridge_best_lambda, newx = new_x2), new_y)
## [1] 399.2579
RMSE(predict(ridge_model, s = ridge_best_lambda, newx = x2), y)
## [1] 15.51221
cat("
    The error on new dataset is lower. It went to 399
    ")
## 
##     The error on new dataset is lower. It went to 399
## 
#xgboost
dtrain <- xgb.DMatrix(data = x, label = y)
dtest <- xgb.DMatrix(data = new_x, label = new_y)

watchlist <- list(train=dtrain, test=dtest)

bstDMatrix <- xgb.train(data = dtrain, max.depth = 2, eta = 1,
                      watchlist=watchlist,
                      eval.metric = "rmse",
                      nthread = 2, nrounds = 2, 
                      objective = "reg:squarederror"
                      )
## [1]  train-rmse:16.589855    test-rmse:18.484869 
## [2]  train-rmse:15.197403    test-rmse:19.507105
# catboost
target <- train$SO2
train$SO2 <- NULL
val_target <- new$SO2
new$SO2 <- NULL

all_data <- bind_rows(train, new)
str(all_data)
## 'data.frame':    242913 obs. of  16 variables:
##  $ year   : int  4013 4013 4013 4013 4013 4013 4013 4013 4013 4013 ...
##  $ month  : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ day    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ hour   : int  2 3 4 6 8 9 11 12 14 16 ...
##  $ PM2.5  : num  7 6 3 3 3 3 3 3 6 9 ...
##  $ PM10   : num  7 6 3 3 6 8 6 6 9 19 ...
##  $ NO2    : num  10 11 12 32 43 28 14 13 11 13 ...
##  $ CO     : int  300 300 300 500 500 400 400 300 400 400 ...
##  $ O3     : num  73 72 72 50 45 59 71 74 77 76 ...
##  $ TEMP   : num  -1.1 -1.4 -2 -2.6 0.1 1.2 2.9 3.9 6 5.9 ...
##  $ PRES   : num  1024 1024 1025 1026 1028 ...
##  $ DEWP   : num  -18.2 -19.4 -19.5 -19.1 -19.2 -19.3 -20.5 -19.7 -19.6 -18.1 ...
##  $ RAIN   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ wd     : chr  "NNW" "NW" "N" "NNE" ...
##  $ WSPM   : num  5.6 3.1 2 2.5 4.1 2.6 3.7 5.1 4.4 3.9 ...
##  $ station: chr  "Glayton" "Glayton" "Glayton" "Glayton" ...
# define categorical variables and encode
cat_clean <- c('wd', 'station')
cat_data <- all_data[cat_clean]
lenc <- sapply(all_data[cat_clean], function(x) LabelEncoder.fit(x))
for (i in cat_clean){
  cat_data[[i]] <- transform(lenc[[i]], all_data[[i]])
}
cat_data <- cbind(cat_data, all_data[c('year', 'month', 'day', 'hour', 'PM2.5', 'PM10', 'NO2', 'CO',
                                       'O3', 'TEMP', 'PRES', 'DEWP', 'RAIN', 'WSPM')])

cat_data2 <-cat_data[c('year', 'month', 'day', 'hour', 'PM2.5', 'PM10', 'NO2', 
                                       'O3', 'TEMP', 'PRES', 'DEWP', 'RAIN', 'WSPM')]
str(cat_data)
## 'data.frame':    242913 obs. of  16 variables:
##  $ wd     : int  8 9 5 7 8 5 5 8 9 8 ...
##  $ station: int  5 5 5 5 5 5 5 5 5 5 ...
##  $ year   : int  4013 4013 4013 4013 4013 4013 4013 4013 4013 4013 ...
##  $ month  : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ day    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ hour   : int  2 3 4 6 8 9 11 12 14 16 ...
##  $ PM2.5  : num  7 6 3 3 3 3 3 3 6 9 ...
##  $ PM10   : num  7 6 3 3 6 8 6 6 9 19 ...
##  $ NO2    : num  10 11 12 32 43 28 14 13 11 13 ...
##  $ CO     : int  300 300 300 500 500 400 400 300 400 400 ...
##  $ O3     : num  73 72 72 50 45 59 71 74 77 76 ...
##  $ TEMP   : num  -1.1 -1.4 -2 -2.6 0.1 1.2 2.9 3.9 6 5.9 ...
##  $ PRES   : num  1024 1024 1025 1026 1028 ...
##  $ DEWP   : num  -18.2 -19.4 -19.5 -19.1 -19.2 -19.3 -20.5 -19.7 -19.6 -18.1 ...
##  $ RAIN   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ WSPM   : num  5.6 3.1 2 2.5 4.1 2.6 3.7 5.1 4.4 3.9 ...
# create train/test pools from training data
train <- cat_data[1:nrow(train),]
val <- cat_data[(nrow(train)+1):nrow(cat_data),]  

train_pool <- catboost.load_pool(data = train, label = target)
val_pool <- catboost.load_pool(data = val, label = val_target)

train2 <- cat_data2[1:nrow(train),]
val2 <- cat_data2[(nrow(train)+1):nrow(cat_data),]  

train_pool2 <- catboost.load_pool(data = train2, label = target)
val_pool2 <- catboost.load_pool(data = val2, label = val_target)

# build model
params <- list(iterations=500,
               learning_rate=0.01,
               depth=5,
               loss_function='RMSE',
               eval_metric='RMSE',
               random_seed = 55,
               od_type='Iter',
               metric_period = 50,
               od_wait=2,
               use_best_model=TRUE)

catboost_model <- catboost.train(train_pool, val_pool, params)
## Warning: Overfitting detector is active, thus evaluation metric is calculated on every iteration. 'metric_period' is ignored for evaluation metric.
## 0:   learn: 20.9724728   test: 23.7247771    best: 23.7247771 (0)    total: 108ms    remaining: 53.7s
## 50:  learn: 16.7947907   test: 21.0546799    best: 21.0546799 (50)   total: 1.41s    remaining: 12.4s
## 100: learn: 14.6333968   test: 19.7359643    best: 19.7359643 (100)  total: 2.74s    remaining: 10.8s
## 150: learn: 13.4292961   test: 18.9754100    best: 18.9754100 (150)  total: 4.08s    remaining: 9.44s
## 200: learn: 12.7053170   test: 18.3875044    best: 18.3875044 (200)  total: 5.38s    remaining: 8s
## Stopped by overfitting detector  (2 iterations wait)
## 
## bestTest = 18.09005565
## bestIteration = 237
## 
## Shrink model to first 238 iterations.
catboost_model2 <- catboost.train(train_pool2, val_pool2, params)
## Warning: Overfitting detector is active, thus evaluation metric is calculated on every iteration. 'metric_period' is ignored for evaluation metric.
## 0:   learn: 20.9746267   test: 23.7248877    best: 23.7248877 (0)    total: 51.4ms   remaining: 25.7s
## 50:  learn: 16.9948831   test: 21.1290432    best: 21.1290432 (50)   total: 1.36s    remaining: 11.9s
## 100: learn: 14.8773268   test: 19.8314519    best: 19.8314519 (100)  total: 2.68s    remaining: 10.6s
## 150: learn: 13.7151416   test: 19.1134485    best: 19.1134485 (150)  total: 4.12s    remaining: 9.52s
## Stopped by overfitting detector  (2 iterations wait)
## 
## bestTest = 18.59473682
## bestIteration = 197
## 
## Shrink model to first 198 iterations.

It appears that the best models, based on increasing validation RMSE value are: catboost, xgboost, ridge, lasso, and elasticnet.

# Build your final model here, use additional coding blocks if you need to
fin.mod <- catboost_model
# Load in the test data.
test <- read.csv("../Data/Regression_test.csv")
# If you are using any packages that perform the prediction differently, please change this line of code accordingly.

# define categorical variables and encode
cat_clean <- c('wd', 'station')
test_cat_data <- test[cat_clean]
lenc <- sapply(test_cat_data[cat_clean], function(x) LabelEncoder.fit(x))
for (i in cat_clean){
  test_cat_data[[i]] <- transform(lenc[[i]], test_cat_data[[i]])
}
test_cat_data <- cbind(test_cat_data, test[c('year', 'month', 'day', 'hour', 'PM2.5', 'PM10', 'NO2', 'CO',
                                             'O3', 'TEMP', 'PRES', 'DEWP', 'RAIN', 'WSPM')])

test_pool <- test_cat_data %>% catboost.load_pool()

pred.label <- catboost.predict(fin.mod, test_pool)
# put these predicted labels in a csv file that you can use to commit to the Kaggle Leaderboard
write.csv(data.frame("RowIndex" = seq(1, length(pred.label)), "Prediction" = pred.label),  
          "RegressionPredictLabel.csv", row.names = F)
## PLEASE DO NOT ALTER THIS CODE BLOCK
## Please skip (don't run) this if you are a student
## For teaching team use only
# source("../supplimentary.R")
# truths <- read.csv("../Regression_truths.csv")
# RMSE.fin <- rmse(pred.label, truths$SO2)
# cat(paste("RMSE is", RMSE.fin))