Posted: January 24th, 2023
Assignment 2: Machine Learning Model Training 1
• Two multi-part, multiple-choice questions.
• AI in Healthcare with Phase 2 data set (HTML file)
• Details of the Q1 & Q2 m/c questions are shown in the attached question files.
• Lecture notes on Machine Learning in Healthcare for your reference
Phase II Model Training Question Sheet
Click on the HTML file attached to read the scenario. After reading through the case, please review the two questions in this assignment.
Keep the HTML file open so that it is easier for you to look for the information questioned in the quiz/exercise. Check all the correct answers with an explanation in 2 or 3 sentences.
Q 1 Part 1.
The team split the data into two partitions: the training set and the test set. It is considered best practice to have a third partition– the validation set. What added utility is there in having a validation set? Check all that apply.
The validation set can be used for tuning hyperparameters
The validation set can be used for early stopping
The test set should be used for final evaluation only
The validation set can be used for updating the model directly
Part 2.
The team split the data randomly, without accounting for the patient to whom each exam belongs. Why would this be a problem? Recall: “The COVID dataset consists of 30,000 exams across 21,000 patients (some patients may be associated with multiple exams)”
Patient overlap between the training and test sets may lead to problems with model convergence due to exposure to the test set
Patient overlap between the training and test sets may lead to problems with model bias because of the underrepresentation of certain patient demographics in the training set
Patient overlap between the training and test sets may lead to the leakage of PHI or other sensitive data
Patient overlap between the training and test sets may lead to inflated model performance due to unrealistic evaluation conditions
Part 3.
The team downsized the images to 224 by 224 pixels. Why might this lead to worse model performance?
The discriminative features in the image may be too small to identify without a higher resolution
Many publicly available models use 224 by 224 pixel images
Memory constraints may limit the model’s ability to process high-resolution images
224 by 224 pixel chest x-rays are easier to classify than 3000 by 3000 pixel chest x-rays
Part 4.
Why are Convolutional Neural Networks (CNN) particularly well suited for image classification tasks? Check all that apply.
CNN architectures take advantage of feature locality through the use of filters
CNN architectures leverage multiple decision trees in order to make their predictions more robust
CNN architectures are parameter-efficient because they use the same set of weights on each region of the image
CNN architectures can condition on previous timesteps, which it takes as input in addition to the images themselves
Part 5.
What learning phenomena is the team observing?
Convergence
Overfitting
Underfitting
Generalization
Part 6
(i)
A colleague approaches you and suggests that it would be better if you created a model that relied only on observable features and exam metadata (patient age, gender, ethnicity, etc.). What trade-offs must be considered when using lab values as features?
Answer in 3- 5 sentences
(ii)
Before using the new public COVID dataset, you want to verify that there is no PHI in the data. What are some privacy issues that could come into play with imaging data?
Answer in 3- 5 sentences
Q 2 Part 1
The D-DIMER values are highly concentrated <1k, but there are many samples that are several orders of magnitude apart from the rest of the samples. What is the most likely explanation for this? (Hint: look at the data samples, particularly the exam metadata.)
There is a large disparity in D-DIMER lab values across patient gender
There is a large disparity in D-DIMER lab values across patient age
The data collected from one the clinics may use different units
The data was collected from two cohorts from two different time periods
Part 2.
Which of the following strategies can be used in order to accommodate for the missing values in the EHR dataset? Check all that apply.
A logistic regression model can be trained after the missing values are synthetically generated, using a process known as imputation
A tree-based model, such as random forest, can be trained directly on the data with missing values
A tree-based model, such as random forest, can be trained after the missing values are synthetically generated, using a process known as imputation
A logistic regression model can be trained directly on the data with missing values
Part 3.
Which of the following is FALSE regarding logistic regression models?
Logistic regression uses the sigmoid activation function
Logistic regression can take unstructured inputs, such as images or text
Logistic regression produces values between 0 and 1, regardless of the scale of the features
Logistic regression is commonly used for classification problems
Part 4.
Which of the following is FALSE regarding random forest models?
Random forest models are a type of decision tree algorithm
Random forest models are highly interpretable
Random forest models learn multiple decision trees that each learn on a subset of the available features
Random forest models require feature normalization (i.e. scaling the features such that they are between 0 and 1) in order to work effectively
SOLUTION
Machine learning model training is the process of using a set of labeled data, called a training dataset, to learn the parameters of a model that can make predictions on new, unseen data. The process typically involves feeding the training data into the model, adjusting the model’s parameters to minimize an error metric, and repeating this process until the model’s performance on the training data is satisfactory. Once the model is trained, it can be used to make predictions on new, unseen data.
Place an order in 3 easy steps. Takes less than 5 mins.