Machine learning

MBD · June 2022

Hello there,

I ran a machine learning analysis using the k nearest neighbor classification algorithm to differentiate between two groups (patients and controls) based on six variables. The problem (but maybe it's not) is that different runs yielded different results. It's not clear to me, therefore, how to report these results. In classical statistics, I run a log regression, and the results (based on the same dataset) are consistence, but this is not the case in machine learning.

It would be great if anyone could help me with this question regarding reporting the different range of machine learning results. I need to report these results for a scientific paper, so any guidelines/or reporting rules in this regard will be helpful.

Many thanks!

EJ · June 2022

How different are the results? Most of ML is based on resampling algorithms so results are expected to differ -- but if they differ by a lot than this would be surprising to me.

E.J.

MBD · June 2022

The range is between 76% to 90% of correct prediction (patient vs control). I feel like it's a wide range of results. Isn't it?

koenderks · June 2022

Each time you run the analysis it randomly selects a training(, validation) and test set to use, so it is expected that the results will be different across runs. You can disable this behavior of the analysis by fixing the seed in the training parameters section. This wil enable you to compare results for the same data set each time the analysis runs.

MBD · June 2022

Thank you for your response.

I've seen that fixed seed isn't recommended in ML. But let say that I run ML with fixed seed. Do I supposed to report the entire range of different results?

Since different runs yield different results, it seems like I'll have endless results and my point is that I'm not sure when it's recommended to stop running the algorithm again and again & which runs' results to report? Is that okay to choose reporting the run with the best results?

Thanks!

koenderks · June 2022

I’m not sure that I follow. If you fix the seed then the results should be the same every time you run the analysis :)

MBD · June 2022

Okay. Got it. Thank you.

EJ · June 2022

Yes, but Koen, isn't it worrying that the results would differ a lot depending on the seed?

koenderks · June 2022

I would say this partially depends on whether parameters in the algorithm are optimized under "Training parameters". In the k-nearest neighbors algorithm, the optimal number of neighbors is trained on the training set and after that optimized on the validation set. Hence, this optimization is dependent on the specification of both the training set and the validation set. Because of this, the results on the test set may differ between runs of the analysis depending on the representativity of the training and validation set. The results will probably differ much less when you compare results across different training and tests sets but keep the number of neighbors fixed under "Training parameters". Let me know if this helps.

Howdy, Stranger!

Categories

Machine learning

Comments

Howdy, Stranger!

Quick Links

Categories

Machine learning

Comments