Module 3, Week 1, gretl Problem Set 5

 

Module 3, Week 1, gretl Problem Set 5

 

This week your gretl assignment will be automated too.  In this document I am providing you complete information to finish it.  Once you have this gretl assignment completed then it should be easy for you to do the Paper and Pencil assignment! 

 

You can use the following commands (between the lines below) to bring the data required to complete this assignment into gretl and create a small sample of it for the Paper & Pencil assignment for Week 5.  This process eliminates observations with erroneous values.  I was only able to get all variables for each observation in this way.  Of course, if you know a different way to do this that’s fine too.  This is just so you don’t have to think too much about importing and exporting data for your assignments right now! 

 

I am pasting my entire script at the end of this document so you will have all commands in one place!  That is so you do not need to try to piece a script together.  You only need to read through the text of this document so you can answer the appropriate questions. 

 

Note, in order for this to work you need to insert the complete path to the folder or directory you are opening data files from and saving your gretl data files, output, etc. to.  These are the paths recovering the data file from my Midterm Exam folder and saving the newly processed data files to Module 3\Week 5\...

 

#Open the entire corrected Boston housing dataset

open "I:\My Passport Documents\McDaniel\DataAnalytics\ANA500\Midterm\correctHousing.gdt"

 

#Sample and store the records before those from Boston itself

smpl 1 356

store "I:\My Passport Documents\McDaniel\DataAnalytics\ANA500\Module3\Week5\first.gdt"

 

#Restore the full dataset to get the remaining records

smpl --full

#Sample and store the remaining records

smpl 489 506

store "I:\My Passport Documents\McDaniel\DataAnalytics\ANA500\Module3\Week5\second.gdt"

 

#Open the file with the first set of records

open "I:\My Passport Documents\McDaniel\DataAnalytics\ANA500\Module3\Week5\first.gdt"

#Append the records in the second file to those in the first file

append "I:\My Passport Documents\McDaniel\DataAnalytics\ANA500\Module3\Week5\second.gdt"

#Store the subset of records

store "I:\My Passport Documents\McDaniel\DataAnalytics\ANA500\Module3\Week5\housingWOBoston.gdt"

 

#Plot the records with those having average rooms=0.0 omitted

gnuplot RM CMEDV --output=display

 

#Look at your data!  You can use descriptive stats to do this

summary

 

#Create a small dataset for your Paper and Pencil Assignment this week.  Keep in mind

#what a small sample size is! 

 

smpl 30 --random

store "I:\My Passport Documents\McDaniel\DataAnalytics\ANA500\Module3\Week5\smallHousingSample.csv"

smpl --full

 

#Now that you have all the data sorted out continue with your gretl assignment

 

 

You will continue to use the gretl dataset you used for the Midterm Examination to answer the following questions.

 

You are interested in the relationship between the average number of rooms in owner-occupied homes (RM) and the (corrected) median home value (CMEDV) from the Boston housing dataset we used for the Midterm Examination.  All answers to this gretl assignment involve your analysis of that dataset. 

 

1.     Which of the variables of interest, RM and CMEDV, are the following?

a.     The independent variable 

b.     The dependent variable 

 

2.     Once you have pre-processed the data (e.g. as above between the lines) so that you have a good set of observations in your dataset, produce a scatterplot that shows how home values change with increasing number of rooms in the home. 

 

gnuplot CMEDV RM --output=display     

 

 

The relationship between the variables CMEDV and RM appears to be linear.   

 

 

3.     Calculate descriptive statistics for the variables of interest (CMEDV and RM).

 

summary RM CMEDV     

 

What is the mean value of CMEDV? 

What is the mean value of RM? 

What is the interquartile range of CMEDV? 

What is the standard deviation of RM? 

 

4.     Calculate the correlation coefficient between the variables of interest.  

 

corr CMEDV RM

 

What is the value you got for the correlation coefficient? 

The correlation coefficient is a measure of the strength of the relationship between two variables.     True or False? 

 

5.     Residuals are the difference between the values of independent variables at different points in time.    True or False? 

 

6.     The least squares method of regression to find the line best fitting the data minimizes the (select the best answer below):

a.     Sum of squared residuals    

b.     Sum of residuals

c.      Sum of the dependent variable squared

d.     Sum of the difference of independent variables squared.

 

7.     What are the assumptions required for conducting a linear regression.  Select all that apply. 

a.     Linearity, i.e. the relationship between dependent variable and independent variable(s) is linear.     

b.     Homoscedasticity, i.e. residuals are roughly equal and scattered about zero.    

c.      Independence, i.e. observations are independent.       

d.     Normality, i.e. the residuals are normally distributed.       

e.     The correlation coefficient between variables equals zero.    

f.       The standard deviations of the dependent variable vary over time resulting in heteroscedasticity. 

g.     At least one of the independent variables depends on other independent variables.

h.     The number of observations is small and the data follow a Student’s t-distribution.

i.       There are a number of explainable outliers in the data resulting in heteroscedasticity.    

 

 

8.     Estimate a simple linear regression model using least squares using the OLS command in gretl.

 

Let’s take a look at the output from the OLS command:

 

ols CMEDV 0 RM    Don’t forget to include the “0” in the command for ordinary least squares computation because that is what gives you the value for the intercept.   

 

 

 

Model 1: OLS, using observations 1-374

Dependent variable: CMEDV

 

             coefficient   std. error   t-ratio    p-value

  ---------------------------------------------------------

  const       −26.0455      2.23293     −11.66    5.26e-027 ***

  RM            7.99551     0.349031     22.91    4.52e-073 ***

 

Mean dependent var   24.70294              S.D. dependent var   8.387353

Sum squared resid    10884.86     S.E. of regression   5.409286

R-squared            0.585176                        Adjusted R-squared   0.584061

F(1, 372)            524.7651                          P-value(F)           4.52e-73

Log-likelihood      −1161.036         Akaike criterion     2326.072

Schwarz criterion    2333.921       Hannan-Quinn         2329.189

 

 


 

Just for comparison sake here is the output from Excel for the same data:

 

SUMMARY OUTPUT

Regression Statistics

Multiple R

0.765121883

R Square

0.585411497

Adjusted R Square

0.584294007

Standard Error

5.414980483

Observations

373

ANOVA

 

df

SS

MS

F

Significance F

Regression

1

15360.72425

15360.72425

523.8632123

6.31553E-73

Residual

371

10878.46706

29.32201364

Total

372

26239.19131

 

 

 

 

Coefficients

Standard Error

t Stat

P-value

Lower 95%

Upper 95%

Lower 95.0%

Upper 95.0%

Intercept

-26.05394429

2.235350016

-11.65542045

5.80305E-27

-30.44948918

-21.6583994

-30.44948918

-21.6583994

6.575

7.997913702

0.349436095

22.88805829

6.31553E-73

7.310789973

8.685037431

7.310789973

8.685037431

The results from Excel are virtually identical to those from gretl.  Both methods’ results match! 

 

 

 


a.    The estimated regression equation is   .  True or False?      

 

b.     Discuss the estimated slope coefficient. 

 

                      i.     The value of the estimated slope coefficient is _______.      

                     ii.     Interpret the estimated slope coefficient by considering the following statement. 

                                                        

The estimated slope coefficient tells you how much the dependent variable, in this case home value, varies with changes in the independent variable, i.e. the average number of rooms in owner-occupied homes.    True or False?      

 

c.      Is the estimated slope coefficient statistically significant? How do you know?

 

                      i.     yes

 

                     ii.     In this case, the P-value equals 4.52e-073 *** or something very, very small and much smaller than the designated 0.05 level of significance.       

 

9.     The coefficient of determination or r-squared, is a measure of how much of the variability in the data is explains the response, i.e. the dependent variable.       

10.  The value of r-squared for our current model is _________?       

11.  At what point or value is the coefficient of determination or r-squared considered a strong indicator?

a.     0.50

b.     0.70

c.      0.90

d.     It depends…     

The truth is that a good value for r-squared depends on what the model you are developing is intended to do.  If the model is intended to represent a lot of engineering or technical applications then usually somewhere between 0.50 and 0.70 is considered good.  However, if you are developing a model for a final consumer product where safety is involved you’ll want a much higher r-squared, e.g. 0.90 or even a lot higher than that.  Most basic R&D projects are good with an r-squared value of around 0.2.  In this case, r-squared is only intended to give enough confidence to refine something to the next step or phase which should have a higher r-squared.  In the social sciences, r-squared = from 0.10 to 0.30 is often considered good.  So, it depends. 

12.  Calculate a 95% confidence interval for the estimated slope coefficient.

 

First, be sure to keep the result you want clearly in mind, i.e. you want the 95% confidence interval for the slope coefficient.  We have almost all the information required to obtain the upper and lower bounds and hence the confidence interval from the OLS output.  The slope coefficient is given in gretl is given by $coeff(RM).  The standard error of the slope coefficient is given by $stderr(RM).  The degrees of freedom is given by $df. 

 

Using an online calculator, or other appropriate means, knowing the degrees of freedom you can get t for the confidence interval.  For a 95% confidence interval the t is 2.262. 

 

The basic way we get a lower bound, e.g. a lower bound is using the regression model (equation) to minimize the sum of squared errors by subtracting appropriate values for the lower bound.  Note that it is easy to use gretl’s “critical” function which you can find more info on in the gretl function reference.  The basic format of the equation for the lower bound is:

 

Lower bound = $coeff(RM) – t-cricical * standard error(RM)

 

To get the critical value use gretl’s built-in “critical” function.  For more info on that see the gretl function reference available through the help menu.  The equations for the lower and upper bounds are:

 

scalar lb = $coeff(RM) - critical(t, $df, 0.025)*$stderr(RM)

scalar ub = $coeff(RM) + critical(t, $df, 0.025)*$stderr(RM)

 

print lb ub

 

a.     The value of the lower bound for the 95% confidence interval of the slope coefficient is _________?   

b.     The value of the upper bound for the 95% confidence interval of the slope coefficient is _________?     

 

 


 

The complete gretl Script for the Module 3 Week 5 gretl assignment. 

 

Note, in order for this to work you need to insert the complete path to the folder or directory you are opening data files from and saving your gretl data files, output, etc. to.  These are the paths recovering the data file from my Midterm Exam folder and saving the newly processed data files to Module 3\Week 5\...

 

 

 

#Open the entire corrected Boston housing dataset

open "I:\My Passport Documents\McDaniel\DataAnalytics\ANA500\Midterm\correctHousing.gdt"

 

#Sample and store the records before those from Boston itself

smpl 1 356

store "I:\My Passport Documents\McDaniel\DataAnalytics\ANA500\Module3\Week5\first.gdt"

 

#Restore the full dataset to get the remaining records

smpl --full

#Sample and store the remaining records

smpl 489 506

store "I:\My Passport Documents\McDaniel\DataAnalytics\ANA500\Module3\Week5\second.gdt"

 

#Open the file with the first set of records

open "I:\My Passport Documents\McDaniel\DataAnalytics\ANA500\Module3\Week5\first.gdt"

#Append the records in the second file to those in the first file

append "I:\My Passport Documents\McDaniel\DataAnalytics\ANA500\Module3\Week5\second.gdt"

#Store the subset of records

store "I:\My Passport Documents\McDaniel\DataAnalytics\ANA500\Module3\Week5\housingWOBoston.gdt"

 

#Look at your data!  You can use descriptive stats to do this

summary

 

#Create a small dataset for your Paper and Pencil Assignment this week.  Keep in mind

#what a small sample size is! 

 

smpl 30 --random

store "I:\My Passport Documents\McDaniel\DataAnalytics\ANA500\Module3\Week5\smallHousingSample.csv"

smpl --full

 

#Now that you have all the data sorted out continue with your gretl assignment

#

#

 

#Plot the records with those having average rooms=0.0 omitted

gnuplot CMEDV RM --output=display

 

#

#

#

 

#Compute the descriptive stats for just the variables of interest

summary CMEDV RM 

 

 

#Compute the correlation coefficient for the variables of interest

corr CMEDV RM  

 

#Develop an ordinary least squares model for this data

ols CMEDV 0 RM --vcv

 

 

#Now we can use the corresponding linear regression (line) to predict what a median home value

#would be for any number of rooms.  For example, for 5 rooms we would get:

scalar CMEDV_hat = $coeff(const) + $coeff(RM)*5

 

 

#

#

#Now let's look at the 95% confidence intervals

 

 

scalar lb = $coeff(RM) - critical(t, $df, 0.025)*$stderr(RM)

scalar ub = $coeff(RM) + critical(t, $df, 0.025)*$stderr(RM)

 

print lb ub

 

 

Comments

Popular posts from this blog

Week 1 Assignment – Data Science Tools

Module 1, Week 2, gretl problem set 2

Week 2 Assignment