intro to data mining its 632

I’m stuck on a Software Development question and need an explanation.

ITS 632


A. The Project

This project has 2 parts:

• 1st part is for everyone as your Weekly Assignment.

• 2nd part is optional if you want to make up your midterm score.

You can get up to 20 points added to your exam score.

To make it easier on you I will allow you to submit it as a group. The groups can be up to 4 people.

But you can submit is alone too. Just let me know your groups in advance.

You can use Decision Trees which we cover or other more sophisticated methods like Random

Forests or Logistics Regression (These two algorithms might generate better results.)

• Part1: You will develop a Classification/Prediction model that predicts whether or not a

patient has a disease or is healthy (Part 1).

• Part2: You will do same for individual diseases. If a patient has that disease or not.

B. Data

1. You are provided with an Excel file that has 11 Worksheets.

2. The first one (Hormones_Diseases) has the Hormones vs Diseases table. This table lists

all Hormonal Measurement values of babies (patients) and their relationship (correlation

level) with each disease. These are coming from doctors (Endocrinologists) who are

experts on these diseases but they are the ones who need help to better diagnose the

patients. So, use these correlation or relationship values as starting points or giving

weights to your attributes (hormones) but do not completely ignore the other hormones

that show no relationship. After all the data might suggest that some of those blank ones

are good predictors in predicting whether or not a genetic disease exists.

3. The Up and Down arrows here shows the direction and strength of these relationship and

correlation between diseases and hormones according to the doctors. For example,

↓↓↓↓ means this hormone is strongly but negatively correlated with this disease

where ↑ means they are mildly but positively correlated. Blank ones are not considered

correlated by the doctors but they also think they are important. We already removed 3

hormones which they deemed no importance at all. Again, please do not ignore the

hormones that are blanks. Make sure you give more importance to the ones with more

arrows (using weights is one option, positive value for Up and negative for Down).

4. The other 10 Worksheets are the measurements of the hormonal levels of each hormone

for the patients that has that disease. For example, all the patients in 2) 21OHD-C Disease

worksheet has the 21OHD-C Disease and these are their measurements. All other

worksheets are same way.

C. Data Preparation

1. For Part 1: you will need to create a Dataset where Patients are listed in tabular format

and their information (attributes) as columns and 1 final attribute which is the class. I

created a worksheet as a last worksheet titled Dataset Format to help you get started but

you can choose your own format. (you can either create this dataset by copy/paste with

Transpose or create a small Macro if you know VBScript) (if you are not going to do Part

2, you do not need to have a column called Disease.)

2. For Part 2: If you are going to Part2, you can either;

a. Create a Classification Model that predicts the Disease (Disease is the class attribute

and this is a multi-class problem).

b. It would be simpler if you develop a Classification Model just like in Part 1, but this

time train and test it with only data that has this disease and some from Healthy.

c. Or you can keep the ones that has the disease as “yes” in Sick? attribute and change

everyone else to no including patients who have other diseases. (meaning that they

do not have that certain genetic diseases)

D. Project & Submission

1. Part 1: Develop a Classification Model that predicts whether or not a patient has a

disease, improve your model to achieve better predication accuracy & submit following:

a) A report that describes the Classification Algorithm you choose and Parameter

settings, the Accuracy and Confusion Matrix

b) The final code either as Jupyter Notebook (preferably) or .py file(s) (Make sure you

explain any dependencies since I need to run your code and see the same results)

c) The dataset you created from Excel file and used in your model training and testing.

2. Part2: Submit the same only this time develop a Model that predict if the patient has

certain disease as explained in section A. Project above.

Get 20% discount on your first order with us. Use code: GET20