2y ago

Should I split my dataset if I'm solely trying to understand feature importance?

Not OP. This question is being reposted to preserve technical content removed from elsewhere. Feel free to add your own answers/discussion.

Original question:

I'm being provided a dataset with several variables in it, and a success metric (1 or 0) at the end. I'm being asked to analyze the dataset and give insights on how to improve the success metric rate. To do this I intend to do a thorough data analysis to study correlations and relationships. However I'm also intending to run a logistic regression to confirm these correlations with the features coefficients.

My question is, if my sole interest is understanding the most important feature determining a metric, and not building a robust model, should I still split my datasets into 2 ? What benefits do I have splitting it ? Won't my exploratory analysis loose interest if I'm putting away - let's say- 20% ?

Thank you for your help

No comments