Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Supported by

Machine learning

Hello there,

I ran a machine learning analysis using the k nearest neighbor classification algorithm to differentiate between two groups (patients and controls) based on six variables. The problem (but maybe it's not) is that different runs yielded different results. It's not clear to me, therefore, how to report these results. In classical statistics, I run a log regression, and the results (based on the same dataset) are consistence, but this is not the case in machine learning.

It would be great if anyone could help me with this question regarding reporting the different range of machine learning results. I need to report these results for a scientific paper, so any guidelines/or reporting rules in this regard will be helpful.

Many thanks!

Comments

  • How different are the results? Most of ML is based on resampling algorithms so results are expected to differ -- but if they differ by a lot than this would be surprising to me.

    E.J.

  • The range is between 76% to 90% of correct prediction (patient vs control). I feel like it's a wide range of results. Isn't it?

  • edited June 2022

    Each time you run the analysis it randomly selects a training(, validation) and test set to use, so it is expected that the results will be different across runs. You can disable this behavior of the analysis by fixing the seed in the training parameters section. This wil enable you to compare results for the same data set each time the analysis runs.

  • Thank you for your response.

    I've seen that fixed seed isn't recommended in ML. But let say that I run ML with fixed seed. Do I supposed to report the entire range of different results?

    Since different runs yield different results, it seems like I'll have endless results and my point is that I'm not sure when it's recommended to stop running the algorithm again and again & which runs' results to report? Is that okay to choose reporting the run with the best results?

    Thanks!

  • I’m not sure that I follow. If you fix the seed then the results should be the same every time you run the analysis :)

  • Okay. Got it. Thank you.

  • Yes, but Koen, isn't it worrying that the results would differ a lot depending on the seed?

  • I would say this partially depends on whether parameters in the algorithm are optimized under "Training parameters". In the k-nearest neighbors algorithm, the optimal number of neighbors is trained on the training set and after that optimized on the validation set. Hence, this optimization is dependent on the specification of both the training set and the validation set. Because of this, the results on the test set may differ between runs of the analysis depending on the representativity of the training and validation set. The results will probably differ much less when you compare results across different training and tests sets but keep the number of neighbors fixed under "Training parameters". Let me know if this helps.

Sign In or Register to comment.