Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Supported by

balanced test set

In the blog whose title is "How to Train a Machine Learning Model in JASP: Classification", and according to the paragraph below, we must create a balanced test set. Can you show me how?

"To ensure we compute reliable performance statistics for our model later on, we created a balanced test set consisting of an equal number of non-churning and churning customers. To do this, we added a custom made indicator called testIndicator that represents 20% of the data that contains an equal proportion of churning to non-churning customers. The middle and right panel of Figure 1 shows the distribution of churning in the indicated equal data, and in the remainder of the data respectively."

Comments

  • I've asked the team

  • edited October 2023

    Hi may01dz,

    You cannot do this in the machine learning module directly but requires the manual inspection of the data and specification of a variable in the data set that indicates which observations belong to the test set (indicated by 1, you have manually balanced these across classes) and which observations belong to the training set (indicated by 0). You can then use this balanced test set for training the model.

    Note that a balanced test set is an ideal scenario, but is often impractical since one class may occur in the data naturally much more often (e.g., sick vs. healthy people, fraud vs non-fraud).

    Hope that clarifies a little!

  • I believe I have grasped your instructions and have executed the following steps:

    I duplicated the data contained within the 'Churn' column and pasted it into a new column which I designated as 'testIndicator.' I have then proceeded to select a representative sample comprising ten percent of customers classified as 'NO,' and an additional ten percent classified as 'YES.' I have represented these selections within the Colum as '1', whereas the remaining eighty percent of customers were represented as '0.'

    I kindly request your confirmation on the correctness of my approach. The file is attached for your reference.

  • Yes, this looks good! You now have a balanced test set and you can use the 'testIndicator' variable as the 'Test set indicator' (as shown below) to index the test set when training any of the algorithms in the machine learning module!


  • Thanks, that's what I did.

Sign In or Register to comment.