Developing AI models on medical images as decision support systems has seen a huge increase in interest during the last few years. However, most published studies have neglected testing the model's robustness against certain dataset-related biases and unbalanced variables. For example, although the prevalence of TP53 mutations is higher in Estrogen Receptor (ER)-negative breast cancer, while most ER-positive tumors are not mutated, published models have been developed on the entirety of the available data without testing for such intrinsic biases that can lead to overfitting. In this study we show that models trained for TP53 mutation prediction overfit on ER status and that stratification of training on the basis of ER is beneficial for all subgroups while it reduces bias and increases generalizability and fairness. (Implementation: https://github.com/tsikup/er-stratified-training-tp53-prediction).