Replies (6)

Troy's avatar
Troy 1 month ago
"I wouldΒ like to buy a vowel."
hello's avatar
hello 1 month ago
i am not expert but i think they are just noise. to not burnout and make it more maintainable, consider this: - use cheap LLM API with your system instruction as main classifier. this will prevent you to label shits of text - use distance-weighted kNN as your cache by using scikit-learn, with embeddings model by using sentence-transformers from huggingface. do not use shits like TF-IDF (IMHO) - when new text comes, you run your kNN, if it hits your threshold you don't hit your LLM API. you just use whatever your kNN cache says - if it didn't hit your threshold, you redirect the text to your LLM API, and append the prediction with text embedding to your kNN cache what is the point? the point is spam problem is cat and mouse game. in this approach you are just tuning your system instruction or threshold if you need. instead of constantly labeling some data
↑