Although deep learning-based computer-aided diagnosis systems have recently achieved expert-level performance, developing a robust model requires large, high-quality data with annotations that are expensive to obtain. This situation poses a conundrum that annually-collected chest x-rays cannot be utilized due to the absence of labels, especially in deprived areas. In this study, we present a framework named distillation for self-supervision and self-train learning (DISTL) inspired by the learning process of the radiologists, which can improve the performance of vision transformer simultaneously with self-supervision and self-training through knowledge distillation. In external validation from three hospitals for diagnosis of tuberculosis, pneumothorax, and COVID-19, DISTL offers gradually improved performance as the amount of unlabeled data increase, even better than the fully supervised model with the same amount of labeled data. We additionally show that the model obtained with DISTL is robust to various real-world nuisances, offering better applicability in clinical setting.