There is usually latent information in the null/missing value which a tree model or deep work">network can signal out.
It gets a bit trickier in the peculiar case where 80% of the response is missing. It might be a good idea to wait for more data or try imputation with a probabilistic model (maybe a Bayesian model for imputation).
Anyhow, being able to classify the non-passing students is probably just as important though ;)
A boosted zero or one inflated beta regression might help you out here, as you could classify the non-passing students and run a beta regression over the grades at the same time. You would be sharing information in the gradient boosting as well.
I probably wouldn’t bother with deep learning here, unless you’ve got a substantial amount of data. LightGBM and Xgboost tends to work just as well if not better on structured 2d tensors.
manpreet
Best Answer
2 years ago
Sometimes data sets contain variables that indicate the presence of an event and the value that represented the event.
As an example say a teacher wants to predict the grades of his students. Some of the students may have been in his class last year and he can use that grade as a variable. However maybe only 20% of the students were in his class so the rest of the 80% will have a Null value. Most ML algorithms cannot accept Null values so the variable would have to somehow be imputed.
I cannot think of an imputation method that would make sense here, the standard mean/mode would imply that all students were in the class and since the variable is pretty unbalance and 80% of the values would be imputed I don't imagine it would hold any valuable information.
Are there any methods to deal with this scenario?