Tutorial by Amy Cruz & Samuel Howard
Millions of people suffer from Dementia or Alzheimer's in the United States. Although Dementia and Alzheimer's are often confused with one another, they are separate conditions. Dementia is an umbrella term used to describe several symptoms that inhibit thinking ability, while Alzheimer's is a degenerative brain disease that often leads to Dementia. Alzheimer's disease can begin 20 years prior to symptoms appearing and therapy has been found to improve the quality of life for those with the disease. For these reasons, predicting both Dementia and Alzheimers is valuable in reducing the suffering caused by these conditions.
It is common knowledge that Dementia and Alzheimer's typically affect the elderly, but genetics are another major risk factor for both conditions. In this tutorial, we will investigate whether other factors could be used to reliably predict Dementia and Alzheimer's.
(Information Sourced from https://www.alz.org/media/Documents/alzheimers-facts-and-figures.pdf)
Fortunately for us, data has been collected relating to Dementia and Alzheimer's. There are several publicly available datasets that could be used to assess how Dementia and Alzheimer's can be predicted. We chose to investigate MRI data from the Open Access Series of Imaging Studies (OASIS) project. This dataset has a good combination of simple attributes (sex, age, etc.) and attributes only measurable via an MRI.
Another dataset we considered was Questionnaire Response Data from the CDC. This data would be good for further analysis, but we found it more difficult to interpret than the OASIS data.
We downloaded the MRI data from Kaggle and obtained the files:
Then we simply used Pandas to bring the data in and store the data in Pandas dataframes:
# Import necessary modules
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
# Use Pandas to read the longitudinal data into a dataframe and print the
# first few rows of the dataframe
data = pd.read_csv('oasis_longitudinal.csv', sep = ',')
data.head()
The longitudinal data consists of samples with the following features:
# Use Pandas to read the cross-sectional data into a dataframe and print the
# first few rows of the dataframe
data2 = pd.read_csv('oasis_cross-sectional.csv', sep = ',')
data2.head()
The cross-sectional data consists of the same features as the longitudinal data, except it is missing subject ID, Group, and Visit.
To start the analysis we decided to do a heatmap of correlations and a correlation matrix to see if there were any patterns or strong relationships between the independent variables and the dependent variables that were discussed.
Dependent variables:
Independent variables:
import pandas as pd
from pandas.plotting import scatter_matrix
import numpy as np
import seaborn as sn
import matplotlib.pyplot as plt
corr = data.corr()
sn.heatmap(corr, annot = True)
plt.show()
scatter_matrix(data)
plt.show()
corr = data2.corr()
sn.heatmap(corr, annot = True)
plt.show()
scatter_matrix(data2)
plt.show()
Initial Analysis:
Cross-Sectional:
Based on the heatmap most of the highest correlation between independent and dependent variables is 0.34 between MMSE and nWBV. This is a low positive correlation and the rest of the positive correlations are even lower. The same goes for the negative correlation relationships between the variables, with the most negative relationship being -0.34 between CDR and nWBV.
An assumption that can be made about MMSE and nWBV is that as the participant who scores a high MMSE score have a higher normalized whole brain volume. This would make sense because they since they have a higher brain volume than they have more use with their brain and are less likely to be demented. This assumption can be made because they have a positive correlation.
A similar assumption can be made for the CDR an nWBV. Participants with higher scores greater than 0 have between very mild to moderate Dementia. If the participant scores high with CDR than they have a smaller nWBV, thus giving the negative correlation.
Longitudinal:
Based on the heatmap and the correlation matrix we can see a similar trend to that of the Cross-Sectional data. The highest positive correlation is between MMSE and nWBV with the correlation of 0.47. And the lowest negative correlation is between CDR and nWBV with the correlation of -0.5. These correlations have the same assumptions as in the Cross-Sectional data.
Another interesting relationship is that MMSE and Age have a relatively strong (relative to the data set) negative correlation of -0.25. Meaning that with more age you will have a lower MMSE score. Also MMSE and Education have a relatively strong positive correlation (relative to the data set) of 0.3, meaning that having people who have had more years of education have greater probability to score a higher MMSE score.
Longitudinal regression analysis:
The analysis of R2 done below is to see the strength of the variation of the dependent variable and the independent variable.
The p-value that is given also tells us, if it's less than .05 (from the 95% confidence interval), it has a signinficant relationship bewteen the variables.
The variables that were chosen for this analysis were chosen because of the strength of their correlation based on the heatmap and the correlation matrix.
Summary of the Analysis:
nWBV vs MMSE:
R2 = 0.984
P>|t| = 0.00
The R2 value tells us that 98.4% of the observed variation for MMSE can be explained by nWBV.
The P>|t| indicates that there is a statistically significant relationship between nWBV and MMSE
SES, EDUC vs MMSE:
SES and EDUC were added together because it increased the the R2 value. This was discoved because they had a strong negative correlation between each other.
R2 = 0.975
P>|t| = 0.00
The R2 value tells us that 97.5% of the observed variation for MMSE can be explained by SES and EDUC.
The P>|t| indicates that there is a statistically significant relationship between SES and nWBV and MMSE
#SLR
# Needed to remove the nan rows because it wouldnt work with nan
no_nan_data = data.dropna()
# Independent Variable
X = no_nan_data['nWBV']
# Dependent Variable
y = no_nan_data['MMSE']
model = sm.OLS(y,X).fit()
predictions = model.predict(X)
model.summary()
#MLR
# Independent Variable
X = no_nan_data[['SES','EDUC']]
# Dependent Variable
y = no_nan_data['MMSE']
model = sm.OLS(y,X).fit()
predictions = model.predict(X)
model.summary()
Cross-Sectional regression analysis:
The variables what were chosen for this analysis were chosen because of the strength of their correlation based on the heatmap and the correlation matrix.
Summary of the Analysis:
nWBV vs MMSE:
R2 = 0.988
P>|t| = 0.00
The R2 value tells us that 98.8% of the observed variation for MMSE can be explained by nWBV.
The P>|t| indicates that there is a statistically significant relationship between nWBV and MMSE
SES, EDUC vs MMSE:
SES and EDUC were added together because it increased the the R2 value. This was discoved because they had a strong negative correlation between each other.
R2 = 0.965
P>|t| = 0.00
The R2 value tells us that 96.5% of the observed variation for MMSE can be explained by SES and EDUC.
The P>|t| indicates that there is a statistically significant relationship between SES and nWBV and MMSE
Age vs MMSE:
R2 = 0.944
P>|t| = 0.00
The R2 value tells us that 94.4% of the observed variation for MMSE can be explained by nWBV.
The P>|t| indicates that there is a statistically significant relationship between Age and MMSE
# Indepenent Variable
no_nan_long = data2.drop(['Delay'], axis=1)
no_nan_long = no_nan_long.dropna()
X = no_nan_long["nWBV"]
# Dependent Variable
y = no_nan_long["MMSE"]
model = sm.OLS(y,X).fit()
predictions = model.predict(X)
model.summary()
# Indepenent Variable
X = no_nan_long[['SES','Educ']]
# Dependent Variable
y = no_nan_long["MMSE"]
model = sm.OLS(y,X).fit()
predictions = model.predict(X)
model.summary()
# Indepenent Variable
X = no_nan_long["Age"]
# Dependent Variable
y = no_nan_long["MMSE"]
model = sm.OLS(y,X).fit()
predictions = model.predict(X)
model.summary()
In order to understand the data better, we want to perform some cleaning of the data. Overall, the goals for this cleaning are to make the data from the cross-sectional dataset more easily comparable to the data from the longitudinal dataset, remove unnecessary columns, separate compound data from exisitng columns into separate columns, and to reorganize the dataframe to make it easier to visibly assess.
The cross-sectional dataset is missing the Subject ID and Visit columns, but this information is still found in the ID column, so we will pull those features out of the ID and make them into their own columns (like how the longitudinal dataset stores this information).
After cleaning, we see that the visit and delay columns each consist of a single value, so they will not be useful for analysis. So we remove them. The samples that were second visits are not the focus of the cross-sectional data, so we put them into a separate dataframe.
This cleaning is performed below:
# Make a duplicate of the cross-sectional data
cs_data = data2.copy()
# Every subject in the study was right handed, so we do not need to keep that
# information in the dataframe. Remove it using 'drop'.
cs_data = cs_data.drop(['Hand'], axis=1)
# The longitudinal data conveniently has columns for both subject ID and
# visit number. Both of these features are visible in the ID, but lets
# make separate columns for them for convenience and consistency.
for index, row in cs_data.iterrows():
cs_data.at[index, "Subject ID"] = row["ID"][:-4]
cs_data.at[index, "Visit"] = int(row["ID"][-1])
# Reorder the columns so that subject ID and Visit appear after ID
reordered_columns = ['ID', 'Subject ID', 'Visit', 'M/F', 'Age', 'Educ',
'SES', 'MMSE', 'CDR', 'eTIV', 'nWBV', 'ASF', 'Delay']
cs_data = cs_data[reordered_columns]
# Part of the dataset consists of 20 individuals who were nondemented and
# imaged a second time. For now, lets isolate this data. We can do this
# easily thanks to the Visit feature we engineered
cs_reliability_data = cs_data[cs_data["Visit"] == 2]
# Now the primary data is the rest of the data, consisting of initial visits
cs_primary_data = cs_data[cs_data["Visit"] == 1]
# Now the primary data set is entirely made of initial visits, so we
# do not need the Visit column. It also looks like the Delay column is
# now entirely NaN, we can use dropna to drop this column (and any other
# column that happens to be all NaN, none for now)
cs_primary_data = cs_primary_data.dropna(axis=1, how='all')
cs_primary_data = cs_primary_data.drop(['Visit'], axis=1)
cs_primary_data
The longitudinal data was alot closer to what we wanted before cleaning than the cross-sectional data. However, we wanted to encode the Group column into separate "dummie" columns. This means that each unique value within the Group column would have a corresponding column consisting of 1s and 0s representing the value of the Group column. This is done to allow for easier analysis later on.
# Duplicate the longitudinal data
long_data = data.copy()
# Every subject in the study was right handed, so we do not need to keep that
# information in the dataframe. Remove it using 'drop'.
long_data.drop(['Hand'], axis=1, inplace=True)
# Create dummie features for 'Group' and add them to the dataframe
dummies = pd.get_dummies(long_data['Group'])
long_data['Converted'] = dummies['Converted']
long_data['Nondemented'] = dummies['Nondemented']
long_data['Demented'] = dummies['Demented']
long_data
This analysis was conducted to see if once the data was cleaned if the correlations between the variables would change. Also given the new variable: converted, nondememnted, and demented we wanted to see if there were better correlation between the new dependent variables and the independent variables.
# checking for relationships once data is cleaned, not sure if necessary
corr = cs_primary_data.corr()
sn.heatmap(corr, annot = True)
print('cross data')
plt.show()
scatter_matrix(cs_primary_data)
plt.show()
print('long data')
corr = long_data.corr()
sn.heatmap(corr, annot = True)
plt.show()
scatter_matrix(data)
plt.show()
Cleaned Cross Sectional analysis:
The variables that were chosen for this analysis were chosen because of the strength of their correlation based on the heatmap and the correlation matrix.
Summary of the Analysis:
nWBV vs MMSE:
R2 = 0.988
P>|t| = 0.00
SES, EDUC vs MMSE:
R2 = 0.965
P>|t| = 0.00
Age vs MMSE:
R2 = 0.944
P>|t| = 0.00
# Cross regression results after cleaning
cs_nonan = cs_primary_data.dropna()
X = cs_nonan['nWBV']
y = cs_nonan['MMSE']
model = sm.OLS(y,X).fit()
predictions = model.predict(X)
model.summary()
cs_nonan = cs_primary_data.dropna()
X = cs_nonan[['SES','Educ']]
y = cs_nonan['MMSE']
model = sm.OLS(y,X).fit()
predictions = model.predict(X)
model.summary()
cs_nonan = cs_primary_data.dropna()
X = cs_nonan["Age"]
y = cs_nonan["MMSE"]
model = sm.OLS(y,X).fit()
predictions = model.predict(X)
model.summary()
Longitudinal regression analysis:
Summary of the Analysis:
nWBV vs MMSE:
R2 = 0.984
P>|t| = 0.00
This R2 is .02 points less than the uncleaned analysis. This may be because cleaning the data removed significant entries.
SES, EDUC vs MMSE:
R2 = 0.975
P>|t| = 0.00
The R2 increased by 0.1, which could mean that some of the data that was removed was negativley affecting the SES, EDUC, and/or MMSE columns.
Age vs MMSE:
R2 = 0.974
P>|t| = 0.00
Again the R2 increased by .03 which could mean the uncleaned data was hindering these variables.
Converted analysis
While doing this analysis it was concluded that the R2 for all of the regressions were all less than 0.2 meaning the independent variables may not be significant by themselves for this dependent variable. They did all have P>|t| = 0.00 meaning they are significant.
Nondemented analysis Most of the R2 resulted in being slightly higher than 0.5, meaning that these variables did have more significance in the variance of being nondemented.
Demented analysis For the demented analysis the R2 ranged between 0.3 and 0.4, showing that about a third of the observed variations were explained with those inputs. The P>|t| for all were also 0.00
long_nonan = long_data.dropna()
X = long_nonan['nWBV']
y = long_nonan['MMSE']
model = sm.OLS(y,X).fit()
predictions = model.predict(X)
model.summary()
X = long_nonan[['SES','EDUC']]
y = long_nonan['MMSE']
model = sm.OLS(y,X).fit()
predictions = model.predict(X)
model.summary()
X = long_nonan['Age']
y = long_nonan['MMSE']
model = sm.OLS(y,X).fit()
predictions = model.predict(X)
model.summary()
X = long_nonan[["EDUC","SES"]]
y = long_nonan["Converted"]
model = sm.OLS(y,X).fit()
predictions = model.predict(X)
model.summary()
X = long_data["Age"]
y = long_data["Converted"]
model = sm.OLS(y,X).fit()
predictions = model.predict(X)
model.summary()
X = long_data["nWBV"]
y = long_data["Nondemented"]
model = sm.OLS(y,X).fit()
predictions = model.predict(X)
model.summary()
X = long_nonan[["EDUC",'SES']]
y = long_nonan["Nondemented"]
model = sm.OLS(y,X).fit()
predictions = model.predict(X)
model.summary()
X = long_data["nWBV"]
y = long_data["Demented"]
model = sm.OLS(y,X).fit()
predictions = model.predict(X)
model.summary()
X = long_nonan[["EDUC",'SES']]
y = long_nonan["Demented"]
model = sm.OLS(y,X).fit()
predictions = model.predict(X)
model.summary()
We want to apply machine learning to the problem of predicting dementia and Alzheimer's. Machine learning is an umbrella term for algorithms that become better at performing tasks by being given data, effectively 'learning' through experience. Our goal in using machine learning is to predict measures related to the presence of dementia and Alzheimer's using some of the other features of our dataset.
We will start with a relatively simple technique in machine learning, linear regression. Linear regression optimizes an equation that takes in some predictor features and outputs a predicted value of a target feature. The linear regression algorithm is fed training data that consists of the predictor variables and the target variable. Using this data, the algorithm attempts to minimize loss. Loss is the difference (measured in different ways) between predicted values and the actual values. The end result of this process is a set of coefficients that are paired with the predictor variables in order to predict target values for new sample data that the algorithm did not see.
That's the basics of how linear regression works, but we do not have to implement it from scratch as we can use existing implementations from scikit-learn. Below we use a linear regression implementation from scikit-learn to create a linear regression models using age, education level, socioeconomic status, eTIV, nWBV, and ASF to predict MMSE and CDR (one model for each of the two target features) from the cross-sectional data.
# Import necessary modules
from sklearn.linear_model import LinearRegression
import seaborn as sns
# For linear regression we will drop rows with NaN values
cs_primary_data = cs_primary_data.dropna()
# Isolate the features that will be used as predictors
cs_input = [
cs_primary_data['Age'],
cs_primary_data['Educ'],cs_primary_data['SES'],
cs_primary_data['eTIV'],cs_primary_data['nWBV'],
cs_primary_data['ASF']]
cs_input = np.vstack(cs_input)
cs_input = cs_input.transpose()
# Fit a model for predicting MMSE
mmse_model = LinearRegression().fit(cs_input, cs_primary_data['MMSE'])
# Fit a model for predicting CDR
cdr_model = LinearRegression().fit(cs_input, cs_primary_data['CDR'])
# Calculate the residual for each row in the dataframe
for index, row in cs_primary_data.iterrows():
cs_primary_data.loc[index, "MMSE_residual"] = row["MMSE"] \
- mmse_model.predict(np.array([
cs_primary_data['Age'][index],cs_primary_data['Educ'][index],
cs_primary_data['SES'][index],cs_primary_data['eTIV'][index],
cs_primary_data['nWBV'][index],cs_primary_data['ASF'][index],
]).reshape(1, -1))
cs_primary_data.loc[index, "CDR_residual"] = row["CDR"] \
- cdr_model.predict(np.array([
cs_primary_data['Age'][index],cs_primary_data['Educ'][index],
cs_primary_data['SES'][index],cs_primary_data['eTIV'][index],
cs_primary_data['nWBV'][index],cs_primary_data['ASF'][index],
]).reshape(1, -1))
# Make a violin plot of mmse model residuals vs. SES
cs_mmse_violin_plot = sns.violinplot(x=cs_primary_data['SES'],
y=cs_primary_data['MMSE_residual'])
cs_mmse_violin_plot.set(xlabel='Socioeconomic status (Hollingshead Index)',
ylabel='MMSE model residual',
title='MMSE Model Residuals with SES (CS)')
Above is a graph of the residuals from the MMSE linear regression model. This shows how large the loss can be between actual MMSE and predicted MMSE (separated in the graph by SES for visibility) from our MMSE linear regression model for the cross-sectional data. This is why the average losses are somewhat close to zero because linear regression attempts to minimize loss. Unfortunately, we would want less outlier residuals if we wanted our regression model to accurately predict MMSE. This indicates that the relationship between our predictors and MMSE is more complicated than what logistic regression is capable of (at least without more complicated tuning of the model).
Below is a similar graph of residuals for the CDR linear regression model.
# Make a violin plot of cdr model residuals vs. SES
cs_cdr_violin_plot = sns.violinplot(x=cs_primary_data['SES'],
y=cs_primary_data['CDR_residual'])
cs_cdr_violin_plot.set(xlabel='Socioeconomic status (Hollingshead Index)',
ylabel='CDR model residual',
title='CDR Model Residuals with SES (CS)')
The outlier residuals for the CDR model are also larger than we would have desired, so we know that CDR is not easily predicted by our predictor variables (just like MMSE).
The above models and graphs were for the cross-sectional data. We want to check if logistic regression would be better at predicting MMSE and CDR for the longitudinal data. Below we train those models and plot the residuals for each graph.
# Import necessary modules
from sklearn.linear_model import LinearRegression
import seaborn as sns
# For linear regression we will drop rows with NaN values
long_data = long_data.dropna()
# Isolate the features that will be used as predictors
long_input = [
long_data['Age'],
long_data['EDUC'],long_data['SES'],
long_data['eTIV'],long_data['nWBV'],
long_data['ASF']]
long_input = np.vstack(long_input)
long_input = long_input.transpose()
# Fit a model for predicting MMSE
mmse_model = LinearRegression().fit(long_input, long_data['MMSE'])
# Fit a model for predicting CDR
cdr_model = LinearRegression().fit(long_input, long_data['CDR'])
# Calculate the residual for each row in the dataframe
for index, row in long_data.iterrows():
long_data.loc[index, "MMSE_residual"] = row["MMSE"] \
- mmse_model.predict(np.array([
long_data['Age'][index],long_data['EDUC'][index],
long_data['SES'][index],long_data['eTIV'][index],
long_data['nWBV'][index],long_data['ASF'][index],
]).reshape(1, -1))
long_data.loc[index, "CDR_residual"] = row["CDR"] \
- cdr_model.predict(np.array([
long_data['Age'][index],long_data['EDUC'][index],
long_data['SES'][index],long_data['eTIV'][index],
long_data['nWBV'][index],long_data['ASF'][index],
]).reshape(1, -1))
# Make a violin plot of model residuals vs. SES
long_mmse_violin_plot = sns.violinplot(x=long_data['SES'],
y=long_data['MMSE_residual'])
long_mmse_violin_plot.set(xlabel='Socioeconomic status (Hollingshead Index)',
ylabel='MMSE model residual',
title='MMSE Model Residuals with SES (Long)')
# Make a violin plot of model residuals vs. SES
long_mmse_violin_plot = sns.violinplot(x=long_data['SES'],
y=long_data['CDR_residual'])
long_mmse_violin_plot.set(xlabel='Socioeconomic status (Hollingshead Index)',
ylabel='CDR model residual',
title='CDR Model Residuals with SES (Long)')
Each of the above residual plots are similar to the residual plots from the cross-sectional data (most residuals are close to zero but significant outliers exist). This reveals that linear regression is most likely not the best method to use for our use case.
Instead we will try using a decision tree.
In general, a decision tree consists of multiple paths originating from a single point. These paths are made of nodes. At each node, a decision is made that determines what node you travel to next. In machine learning, decision trees' "decisions" are splits on variables that divide whatever inputs are fed to it. In a classification task, a decision tree splits the input data continuously in order to predict categories that a sample falls under. Regression tasks are similar, except the tree predicts a value for the target feature. Much like linear regression models, the algorithm used to make decision trees seeks to minimize loss between the predicted values and actual values for the samples in its training set.
# Import necessary modules
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn import tree
# Isolate the features that will be used as predictors from the
# longitudinal data. Education is not included this time because
# the education features use different scales in each dataset and
# we want to apply the decision trees to both data sets.
long_input = [
long_data['Age'],
long_data['SES'],
long_data['eTIV'],long_data['nWBV'],
long_data['ASF']]
long_input = np.vstack(long_input)
long_input = long_input.transpose()
# Isolate the features that will be used as predictors from the
# cross-sectional data.
cs_input = [
cs_primary_data['Age'],
cs_primary_data['SES'],
cs_primary_data['eTIV'],cs_primary_data['nWBV'],
cs_primary_data['ASF']]
cs_input = np.vstack(cs_input)
cs_input = cs_input.transpose()
# Make Decision Tree Regressor on long data for MMSE
long_clf_mmse = DecisionTreeRegressor(random_state=0,max_depth=10)
long_clf_mmse.fit(long_input, long_data['MMSE'])
# Print scores and plot the top of decision tree
print("Longitudinal MMSE classifier on longitudinal data R^2 Score: " +
str(long_clf_mmse.score(long_input, long_data['MMSE'])))
print("Longitudinal MMSE classifier on cross-sectional data R^2 Score: " +
str(long_clf_mmse.score(cs_input, cs_primary_data['MMSE'])))
print("Longitudinal MMSE Decision Tree")
fig, ax = plt.subplots(figsize=(30, 12))
tree.plot_tree(long_clf_mmse,fontsize=12,feature_names=['Age','SES','eTIV','nWBV','ASF'],max_depth=2)
plt.show()
# Make Decision Tree Regressor on cross-sectional data for MMSE
cs_clf_mmse = DecisionTreeRegressor(random_state=0,max_depth=10)
cs_clf_mmse.fit(cs_input, cs_primary_data['MMSE'])
# Print scores and plot the top of decision tree
print("Cross-sectional MMSE classifier on cross-sectional data R^2 Score: " +
str(cs_clf_mmse.score(cs_input, cs_primary_data['MMSE'])))
print("Cross-sectional MMSE classifier on longitudinal data R^2 Score: " +
str(cs_clf_mmse.score(long_input, long_data['MMSE'])))
print("Cross-sectional MMSE Decision Tree")
fig, ax = plt.subplots(figsize=(30, 12))
tree.plot_tree(cs_clf_mmse,fontsize=12,feature_names=['Age','SES','eTIV','nWBV','ASF'],max_depth=2)
plt.show()
# Make Decision Tree Regressor on long data for CDR
long_clf_cdr = DecisionTreeRegressor(random_state=0,max_depth=10)
long_clf_cdr.fit(long_input, long_data['CDR'])
# Print scores and plot the top of decision tree
print("Longitudinal CDR classifier on longitudinal data R^2 Score: " +
str(long_clf_cdr.score(long_input, long_data['CDR'])))
print("Longitudinal CDR classifier on cross-sectional data R^2 Score: " +
str(long_clf_cdr.score(cs_input, cs_primary_data['CDR'])))
print("Longitudinal CDR Decision Tree")
fig, ax = plt.subplots(figsize=(30, 12))
tree.plot_tree(long_clf_cdr,fontsize=12,feature_names=['Age','SES','eTIV','nWBV','ASF'],max_depth=2)
plt.show()
# Make Decision Tree Regressor on cross-sectional data for CDR
cs_clf_cdr = DecisionTreeRegressor(random_state=0,max_depth=10)
cs_clf_cdr.fit(cs_input, cs_primary_data['CDR'])
# Print scores and plot the top of decision tree
print("Cross-sectional CDR classifier on cross-sectional data R^2 Score: " +
str(cs_clf_cdr.score(cs_input, cs_primary_data['CDR'])))
print("Cross-sectional CDR classifier on longitudinal data R^2 Score: " +
str(cs_clf_cdr.score(long_input, long_data['CDR'])))
print("Cross-sectional CDR Decision Tree")
fig, ax = plt.subplots(figsize=(30, 12))
tree.plot_tree(cs_clf_cdr,fontsize=12,feature_names=['Age','SES','eTIV','nWBV','ASF'],max_depth=2)
plt.show()
Above we make a decision tree from both the longitudinal and cross-sectional datasets for predicting both MMSE and CDR. Each tree was trained on one of the datasets and then used on both the set it was trained on and the dataset it was not trained on. Naturally, we would expect the R2 for the training data to be very high and the R2 for the other dataset to be lower. The R2 on the training data is > 0.9 for each tree, so the decision trees are able to accurately predict the data from the dataset it was trained on. Unfortunately, the R2 for the other dataset is extremely low for each of the decision trees (< 0 for three of the four datasets). This tells us that our decision trees are overfitted to the training data and are not good predictors of MMSE and CDR in a general sense. The trees all have a max depth of 10, which is responsible for both the high accuracy for the regression on the training set and the overfitting. Unfortunately, lowering the max depth did not significantly improve the performance for the trees on their testing data (but did lower their performance on their training data).
The problem of predicting Dementia and Alzheimer's is a complicated one. It certainly cannot be cracked in a single tutorial. However, we have learned some important and interesting insights by analyzing MRI data of individuals with and without dementia. We believe these insights (although not groundbreaking) are related to deeper patterns that could be used to predict dementia and Alzheimers, hopefully leading to proactive therapy that can improve quality of life for individuals with those conditions.
Our MRI data has features directly tied to dementia and Alzheimer's. Mini-mental state examination (MMSE) and Clinical Dementia Rating (CDR) are measurements that are often used to quantify the severity of dementia and Alzheimer's. So we used these are our target features. Through exploratory data analysis we discovered several features that correlated with MMSE and or CDR. These features were socioeconomic status, education, nWBV, MMSE, ASF and age.
Then we applied machine learning to test whether the relationship between our predictors and targets could be used to accurately predict our targets. Using linear regression, we were able to accurately fit the data (except for an unfortunate amount of outliers). This proved the relationship between our predictors and targets was usable in machine learning, but perhaps linear regression would not be enough to properly predict MMSE and CDR. So we used decision trees. Unfortunately, we came across overfitting issues that were not easily resolved. It appears that more data would be required to better predict our targets.
One of our most surprising insights from analyzing the MRI data was how education and socioeconomic status have a relationship with CDR and MMSE. This relationship needs to be further researched to be understood. Are those with poorer (or greater) socioeconomic statuses more likely to have dementia or Alzheimer's later in life? Are the less (or more) educated likely to have more severe dementia symptoms? Our tutorial may not have answered these questions, but we were able to reach these questions due to our analysis.
If you want to do more with data science, or the problem of predicting Dementia and Alzheimer's, please see the following links:
Statistical Analysis
Machine Learning
Linear Regression
Decision Trees