Hey everyone! ![]()
I’m building a similarity search for recommendations. I don’t have ground truth - just some expected test cases showing I’m on the right track. The similarity search is working great and clearly better than what we had before.
Traditional metrics like precision/recall need ground truth labels, which makes them pointless in my case. But I still want proof that the system is performing well.
Has anyone dealt with this? Looking for:
-
Alternative evaluation approaches without ground truth
-
Proxy metrics or empirical validation methods
-
Expert evaluation / human judgment approaches
-
A/B testing strategies you’ve found useful
Any insights appreciated! ![]()