I would pick one of two parts of that analysis that are most relevant to you and zoom in. I'd choose something difficult that the model fails at, then look carefully at how the model failures change as you test different model generations.