My Q: what's the benefit of k-means clustering over percentiles ?
There’s no hard rule that says one is “better” universally—they each have their own advantages. Here’s how to think about k-means versus a percentile-based approach:
1. Granularity vs. Simplicity
• Percentiles: You define discrete cutoffs (like quartiles or deciles) and group users accordingly. It’s simple, fast, and easy to explain: “light posters” are in the bottom 25%, “heavy posters” in the top 25%, etc.
• k-means: Clustering is data-driven. It finds natural groupings in your data without you having to pick arbitrary boundaries like 25% or 75%. It can capture nuances in how users are distributed (like an extreme outlier cluster of “super-posters”) more dynamically.
2. Automatic Handling of Outliers
• Percentiles: By definition, if someone is an extreme outlier, they’ll just fall into the top bucket—even if they’re posting way more than most others in that bucket. A single user in the 99.9th percentile looks exactly like someone at the 80th percentile if you just label them “top 25%.”
• k-means: If an outlier is truly far away from the rest, it might form its own cluster (depending on your chosen ). This can help differentiate the really extreme posters from the moderately heavy ones.
3. Flexibility in the Number of Groups
• Percentiles: Usually you pick common breakpoints like quartiles, deciles, etc. The groups might not reflect the actual natural distribution of your data.
• k-means: You choose how many clusters  you want, and it will algorithmically partition the data into that many “best-fit” groups based on variance minimization. You can experiment with different  values (e.g., 3, 5, or 8) to see which grouping captures the distribution best.
4. Interpretability
• Percentiles: Super easy to explain to users—“You’re in the top 25% of posters.”
• k-means: The meaning of each cluster can be less obvious to an end user (“You’re in cluster #2!” doesn’t mean much without explanation). But with some labeling strategy (“light,” “medium,” “heavy,” “extreme”), it can still be made user-friendly.
5. Implementation Complexity
• Percentiles: Requires a single sort of the data, then slicing by your chosen cutoff points. Very easy to implement.
• k-means: You need to run the k-means algorithm, which is still pretty straightforward (especially with existing libraries), but is more involved than a simple sort-and-slice.
Bottom Line
• If you want a quick, simple, predictable way to bucket people, go with percentiles.
• If you have lots of users, an unknown distribution, and want data-driven groupings (especially if outliers matter), k-means can do a better job of naturally reflecting how users cluster in real-world data.
Often, people start with percentiles (because it’s so easy) and later switch to k-means or a similar clustering approach for better nuance. It really depends on how “smooth” or “dynamic” you want your bucketing to be.
Login to reply
Replies (2)
I’ll probably start with fixed bucketing because that is simpler, but i like the idea of clustering for my visualizer. We shall see
This makes sense, I'm sure it's very fast either way