Knowee
Questions
Features
Study Tools

21. A student on attachment is preparing the dataset to be used for training a linear regression model in Scikit Learn. During exploratory data analysis, he has detected multiple feature columns that have missing values. The percentage of missing data across the whole training dataset is about 15%. The Specialist is worried that this might cause bias to his model that can lead to inaccurate results. Which approach will MOST likely yield the best result in reducing the bias caused by missing values?Compute the mean of non-missing values in the same column and use the result to replace missing values.Use supervised learning methods to estimate the missing values for each featureCompute the mean of non-missing values in the same row and use the result to replace missing values.Drop the columns that include missing values because they only account for 10% of the training data.

Question

  1. A student on attachment is preparing the dataset to be used for training a linear regression model in Scikit Learn. During exploratory data analysis, he has detected multiple feature columns that have missing values. The percentage of missing data across the whole training dataset is about 15%. The Specialist is worried that this might cause bias to his model that can lead to inaccurate results. Which approach will MOST likely yield the best result in reducing the bias caused by missing values?Compute the mean of non-missing values in the same column and use the result to replace missing values.Use supervised learning methods to estimate the missing values for each featureCompute the mean of non-missing values in the same row and use the result to replace missing values.Drop the columns that include missing values because they only account for 10% of the training data.
...expand
🧐 Not the exact question you are looking for?Go ask a question

Solution

The best approach to handle missing values in this case would be to use supervised learning methods to estimate the missing values for each feature. This method is more sophisticated than simply replacing missing values with the mean of non-missing values, which can introduce bias if the missingness is not completely at random. Dropping the columns with missing values is not advisable as it can lead to loss of valuable information, especially when the percentage of missing data is as high as 15%.

This problem has been solved

Similar Questions

18. A data scientist within an insurance company is training a model to predict the probability of claims on motor insurance book. The train data set has 5000 samples. One of the variable in the training data is the location. The experts in the company have adviced the data scientist that the location is an important variable in increasing or decreasing the chances of claiming. Upon analyzing the data, the data scientist observed that there are 550 samples where the location has missing values. Which of the following can the data scientist do to deal with problem that he has observed?drop the the location columns since it has more that 10% missing valuesdrop all the rows with missing valuesimpute missing values using the most frequent locationuse KNN imputer

(b) Discuss why it is necessary to handle missing values. Use Python program to identify the variables with missing values in the given dataset. [No more than 200 words (including in-text citation, excluding Python code)] (15 marks)

Which of the following is NOT a recommended way of dealing with missing values:Group of answer choicesUse a model with predicts the missing value from the other fieldsRemove the whole column if there are missing values in some row of that columnPut a NULL where there is a missing valueRemove the whole row if there are missing values in some column of that row

How we can deal with missing data? Please select all that apply. Using other questions as a guide to arrive at an answer. Replacing the missing value with a value from a different respondent. Using the mean of a subsample of similar respondents. Using the mean of the entire sample.

Which data pre-processing technique is commonly used to handle missing data in a dataset?a.Feature scalingb.Outlier detectionc.Imputationd.Principal Component Analysis (PCA)

1/1

Upgrade your grade with Knowee

Get personalized homework help. Review tough concepts in more detail, or go deeper into your topic by exploring other relevant questions.