Only one word comes to mind when I think of who is running these bots, but I will self censor, let's just say the word contains F, a couple of Gs and a T
i am not expert but i think they are just noise. to not burnout and make it more maintainable, consider this:
- use cheap LLM API with your system instruction as main classifier. this will prevent you to label shits of text
- use distance-weighted kNN as your cache by using scikit-learn, with embeddings model by using sentence-transformers from huggingface. do not use shits like TF-IDF (IMHO)
- when new text comes, you run your kNN, if it hits your threshold you don't hit your LLM API. you just use whatever your kNN cache says
- if it didn't hit your threshold, you redirect the text to your LLM API, and append the prediction with text embedding to your kNN cache
what is the point? the point is spam problem is cat and mouse game. in this approach you are just tuning your system instruction or threshold if you need. instead of constantly labeling some data