Q1: Read these two data sets into R. Use the variable names of the summary

statistics from the file “crime-data-info.txt” as the 6-th to 147th column names of

the data set “CrimeData”. The first five column names of the data set “CrimeData”

are, respectively, “State”,”County”,”CommCode”,”CommName” and “fold”.

Find the feature variables in “CrimeData” that have missing values and delete the

columns with missing values. Also delete the columns corresponding to “State”,

“County”,”CommCode” and “fold”. The resulting data set is a complete data set

without missing values (name it as “CompleteCrimeData”), which has 101

columns. Among them, the first column is the community names and last column

is the target variable: the total number of violent crimes per 100K populations

Q2: Use the first 1500 rows (communities) as the training data set and the last

494 rows as the test data set. Fit a linear regression using all the 99 feature

variables with the target variable as the response, and estimate the coefficients in

the linear regression model using Least Angle Regression (LAR) for a sequence of

tuning parameters. Plot the solution paths of all the LAR coefficient estimators.

Q3: Based on the LAR estimator in Q2, if one would like to obtain a LASSO

estimate of coefficients in the above linear regression, could you specify the

smallest tuning parameter that would make the LAR estimator and LASSO

estimator different?

Q4: Compare the LAR estimator with the LASSO estimator via the plots of the

entire solution paths, and identify the portion of the solution paths where these

two estimators are the same. Which feature variable has different solution paths

at the tuning parameter you find in Q9? In terms of computational complexity,

how many more steps LASSO estimator used when it is compared with the LAR

estimator?

Q5: LASSO estimator depends on the tuning parameter. Different tuning

parameter would produce different estimators with different numbers of nonzeros. Use the cross-validation method to choose tuning parameters for the

LASSO estimator and identify the tuning parameter that would minimize the

mean square errors of the predictions.

Q6: Based on the tuning parameter chosen in Q5, predict the target variable use

the test data set given in Q3. Find the sum of the square errors (SSE) of the

prediction using the LASSO estimator.

Q7: Use the first 1500 rows (communities) as the training data set and the last

494 rows as the test data set. Fit a linear regression using the feature variables

with non-zero coefficients selected by LASSO in Q6 with the target variable as the

response, and estimate the coefficients in the linear regression model using ridge

estimator for a sequence of tuning parameters. Plot the solution paths of all the

ridge coefficient estimators.

Q8: Apply a ten-fold cross-validation method to the training data set in Q7 and

find the tuning parameter that minimizes the prediction error. Use the tuning

parameter chosen by the ten-fold cross-validation, predict the target variables in

the test data set and evaluate the SSE of the prediction errors. Compare the SSE

given by LASSO and the SSE given by the ridge regression.

we will apply ridge estimation and LASSO methods to a crime

data set available at

http://archive.ics.uci.edu/ml/datasets/communities+and+crime. The data set

contains socio-economic data from three sources: the 1990 US Census, law

enforcement data from the 1990 US LEMAS survey, and crime data from the 1995

FBI UCR. There are 127 attributes including 122 predictive feature variables and 5

non-predictive attributes in the data set. These 122 attributes are considered to

be related to crime. There are two types of feature variables included in the data

set:

1. Community-related survey data: such as the percent of the population

considered urban, and the median family income;

2. Law enforcement data: such as per capita number of police officers, and

percent of officers assigned to drug units.