The human microbiome, consisting of diverse microbial communities, plays a crucial role in health and disease. Identifying associations between microbes and diseases holds the potential to revolutionize diagnosis, prevention, and treatment strategies. Despite the high accuracy reported by many predictive models in microbe-disease association tasks, their evaluation frameworks often overlook biases and distributional challenges inherent in the datasets, resulting in unreliable performance metrics. To overcome this limitation, we propose a novel evaluation framework that systematically adjusts for node degree distribution, ensuring a more rigorous and reliable assessment of model performance. Furthermore, we present a predictive model that leverages deep artificial neural networks and graph learning algorithms, designed to be improved under these new evaluation criteria for microbe-disease association prediction. A comparative analysis with six state-of-the-art models shows that our approach consistently outperforms existing methods across both traditional and newly proposed evaluation frameworks. Our proposed evaluation framework introduces a sampling algorithm to construct test sets with nearly balanced numbers of positive and negative associations for each node, offering a more unbiased and comprehensive evaluation. Our new predictive model integrates heterogeneous biological network data through the MDKG knowledge graph, which encompasses nine biological entities, including microbes and diseases, connected by 39 distinct edge types. We employ the Node2Vec algorithm to extract meaningful features from this knowledge graph, generating rich feature representations for microbes and diseases. These features are then processed by a deep artificial neural network specifically designed for microbe-disease association prediction. In our study, we used the HMDAD dataset, a widely utilized resource that compiles known microbe-disease associations from previous research articles. This dataset is the most commonly used benchmark in this problem domain, ensuring comparability and relevance. Our model achieved an average AUC of 0.91 in the traditional evaluation framework, outperforming or matching existing models. Under the proposed evaluation framework, it achieved an average AUC of 0.73, surpassing the performance of competing models, which ranged between 0.41 and 0.67. Using the Wilcoxon signed-rank test, we demonstrated that our model significantly outperforms all other models in the new evaluation framework and outperforms five models in the traditional framework, achieving equal performance with the sixth.