I found that my performance on testing data given by the leaderboard is much lower than performance on training data.
Is it because that the testing data is much more challenging than the training data?
I just want to make sure whether the difference is due to the data itself, or due to my code.
By the way, I was doing 10-fold cross validation on the training data, and the weighted Brier score looks good, but once I submitted to the leaderboard, the weighed Brier score becomes much much worse (increases with 0.08~0.09).
Is there anyone who has the same problem as mine?
Or your performance on training and testing data are quite close?

My LB result is always much worse than CV results, and it’s not always the same amount worse. Sometimes, even though my CV results get better, the LB can still get worse, and I haven’t figured out why.
It might be the problem of overfitting. Maybe it’s just because the training data and testing data are so different.

One thing to realize when cross validating is that for whatever reason test data has less observations per file than training data. Training data has many minutes of observations, while testing has 5 to 30 seconds of observations. Depending on what you’re doing this may have an effect. You probably want to split each training file into small blocks in order to better simulate the testing files.

The way these websites work is that there is a public leaderboard and a private leaderboard.

You send your predictions. Part of them will be validated for the public leaderboard, and part of them will be validated by the private leaderboard. The private leaderboard is only revealed at the end. This is made to discourage overfit.