I’m stuck on a Software Development question and need an explanation.
FINAL DATA MINING PROJECT
A. The Project
This project has 2 parts:
• 1st part is for everyone as your Weekly Assignment.
• 2nd part is optional if you want to make up your midterm score.
You can get up to 20 points added to your exam score.
To make it easier on you I will allow you to submit it as a group. The groups can be up to 4 people.
But you can submit is alone too. Just let me know your groups in advance.
You can use Decision Trees which we cover or other more sophisticated methods like Random
Forests or Logistics Regression (These two algorithms might generate better results.)
• Part1: You will develop a Classification/Prediction model that predicts whether or not a
patient has a disease or is healthy (Part 1).
• Part2: You will do same for individual diseases. If a patient has that disease or not.
1. You are provided with an Excel file that has 11 Worksheets.
2. The first one (Hormones_Diseases) has the Hormones vs Diseases table. This table lists
all Hormonal Measurement values of babies (patients) and their relationship (correlation
level) with each disease. These are coming from doctors (Endocrinologists) who are
experts on these diseases but they are the ones who need help to better diagnose the
patients. So, use these correlation or relationship values as starting points or giving
weights to your attributes (hormones) but do not completely ignore the other hormones
that show no relationship. After all the data might suggest that some of those blank ones
are good predictors in predicting whether or not a genetic disease exists.
3. The Up and Down arrows here shows the direction and strength of these relationship and
correlation between diseases and hormones according to the doctors. For example,
↓↓↓↓ means this hormone is strongly but negatively correlated with this disease
where ↑ means they are mildly but positively correlated. Blank ones are not considered
correlated by the doctors but they also think they are important. We already removed 3
hormones which they deemed no importance at all. Again, please do not ignore the
hormones that are blanks. Make sure you give more importance to the ones with more
arrows (using weights is one option, positive value for Up and negative for Down).
4. The other 10 Worksheets are the measurements of the hormonal levels of each hormone
for the patients that has that disease. For example, all the patients in 2) 21OHD-C Disease
worksheet has the 21OHD-C Disease and these are their measurements. All other
worksheets are same way.
C. Data Preparation
1. For Part 1: you will need to create a Dataset where Patients are listed in tabular format
and their information (attributes) as columns and 1 final attribute which is the class. I
created a worksheet as a last worksheet titled Dataset Format to help you get started but
you can choose your own format. (you can either create this dataset by copy/paste with
Transpose or create a small Macro if you know VBScript) (if you are not going to do Part
2, you do not need to have a column called Disease.)
2. For Part 2: If you are going to Part2, you can either;
a. Create a Classification Model that predicts the Disease (Disease is the class attribute
and this is a multi-class problem).
b. It would be simpler if you develop a Classification Model just like in Part 1, but this
time train and test it with only data that has this disease and some from Healthy.
c. Or you can keep the ones that has the disease as “yes” in Sick? attribute and change
everyone else to no including patients who have other diseases. (meaning that they
do not have that certain genetic diseases)
D. Project & Submission
1. Part 1: Develop a Classification Model that predicts whether or not a patient has a
disease, improve your model to achieve better predication accuracy & submit following:
a) A report that describes the Classification Algorithm you choose and Parameter
settings, the Accuracy and Confusion Matrix
b) The final code either as Jupyter Notebook (preferably) or .py file(s) (Make sure you
explain any dependencies since I need to run your code and see the same results)
c) The dataset you created from Excel file and used in your model training and testing.
2. Part2: Submit the same only this time develop a Model that predict if the patient has
certain disease as explained in section A. Project above.