Shubhamai here. This is my complete journey on how to implement a Kaggle competition and what I learned when the competition ends with my results and the 1st placed solution.
For the complete April 2020, I worked on a Kaggle competition which was a Research Prediction Competition named Plant Pathology 2020 — FGVC7 to Identify the category of foliar diseases in apple trees.
Simply it was a multi-class classification for this plant leave disease detection. So, I took the challenge and started working on it.
The complete journey of this is in these two blogs here, and also I deploy the model in production.
So, the competition ended on 27 May 2020, and I was really excited to see the solution and learned what was the difference between 1st solution and 616 placed solution (mine😄) and let’s go —
So, here are the key feature of the first placed solution
Everything was pretty normal and the main thing is he uses a Single Model , that’s really awesome.
- Width shift, height shift
- horizontal and vertical flip
I used an ensemble model rather than a single model, one was Xception and another was DenseNet121. and average the output they gave.
Other than that, I also use the Learning Rate Scheduler.
The Main Difference (Problem)
The biggest problem was that some dataset samples were wrong and there was an imbalanced dataset. So 1st placed solution for eliminating this problem was like this
knowledge distillation method, first train a 5-fold models and get out-of-fold results about valid dataset, and then mix the out-of-fold results and ground truth by 3: 7 as the labels of a new training model.
And other top solution was to
Oversample and Class weight were used to solve data imblance, especially for multi disease.
Surely to try these things in another competition with similar class imbalance or wrong label features.
Focal Loss was also a loss that many top kagglers use in this competition, sure to check out and experiment with these new terms in the next competition.
Where I was wrong?
I didn’t think about how to solve the problem of the imbalanced dataset, and I also did explore the dataset very much to see that there are wrong labels and how to cope with them.
Rather than this, I worked heavily of making the models, experimenting with models and hyperparameters.
Key Takeaway 🔑
Always ALWAYS, Explore your dataset fully. It really gives the greatest contribution to win any competition.
PS Also I have recently started my own newsletter which is **The mix of the Latest breakthrough in AI, Space & Science. **If you are interested in something like this, you can join me via this link, it’s completely free
The mix of Latest breakthrough in AI, Space & Science
Welcome to Shubhamai's Newsletter by me, Shubhamai. Machine Learning Engineer | Teaching Assistant @ ZTM | Helping…shubhamai.substack.com
Originally published at https://www.notion.so.