Thread - Nostr Hypermedia

I like the idea of local algos but k-means sounds like overkill for what seems like a rank ordering problem. It'd be cool though if some kind of semantic clustering of message content were possible.

1 replies ↓

jb55 _@jb55.com 1 year ago

I thought the same but chatgpt o1 said it might be worth trying if the distribution is unknown

1 replies ↓

jb55 _@jb55.com 1 year ago

My Q: what's the benefit of k-means clustering over percentiles ? There’s no hard rule that says one is “better” universally—they each have their own advantages. Here’s how to think about k-means versus a percentile-based approach: 1. Granularity vs. Simplicity • Percentiles: You define discrete cutoffs (like quartiles or deciles) and group users accordingly. It’s simple, fast, and easy to explain: “light posters” are in the bottom 25%, “heavy posters” in the top 25%, etc. • k-means: Clustering is data-driven. It finds natural groupings in your data without you having to pick arbitrary boundaries like 25% or 75%. It can capture nuances in how users are distributed (like an extreme outlier cluster of “super-posters”) more dynamically. 2. Automatic Handling of Outliers • Percentiles: By definition, if someone is an extreme outlier, they’ll just fall into the top bucket—even if they’re posting way more than most others in that bucket. A single user in the 99.9th percentile looks exactly like someone at the 80th percentile if you just label them “top 25%.” • k-means: If an outlier is truly far away from the rest, it might form its own cluster (depending on your chosen ). This can help differentiate the really extreme posters from the moderately heavy ones. 3. Flexibility in the Number of Groups • Percentiles: Usually you pick common breakpoints like quartiles, deciles, etc. The groups might not reflect the actual natural distribution of your data. • k-means: You choose how many clusters you want, and it will algorithmically partition the data into that many “best-fit” groups based on variance minimization. You can experiment with different values (e.g., 3, 5, or 8) to see which grouping captures the distribution best. 4. Interpretability • Percentiles: Super easy to explain to users—“You’re in the top 25% of posters.” • k-means: The meaning of each cluster can be less obvious to an end user (“You’re in cluster #2!” doesn’t mean much without explanation). But with some labeling strategy (“light,” “medium,” “heavy,” “extreme”), it can still be made user-friendly. 5. Implementation Complexity • Percentiles: Requires a single sort of the data, then slicing by your chosen cutoff points. Very easy to implement. • k-means: You need to run the k-means algorithm, which is still pretty straightforward (especially with existing libraries), but is more involved than a simple sort-and-slice. Bottom Line • If you want a quick, simple, predictable way to bucket people, go with percentiles. • If you have lots of users, an unknown distribution, and want data-driven groupings (especially if outliers matter), k-means can do a better job of naturally reflecting how users cluster in real-world data. Often, people start with percentiles (because it’s so easy) and later switch to k-means or a similar clustering approach for better nuance. It really depends on how “smooth” or “dynamic” you want your bucketing to be.

1 replies ↓

jb55 _@jb55.com 1 year ago

I’ll probably start with fixed bucketing because that is simpler, but i like the idea of clustering for my visualizer. We shall see

1 replies ↓

npub1xg5s...fx2g 1 year ago

This makes sense, I'm sure it's very fast either way

Constant Constant@techno-ethica.com 1 year ago

My thinking on this recently turned things around in a sense. These platform have a luxury Nostr has not, in there being a complete global state of all the things on a 'single' database. This means that before you can even filter, you need to explore. Now i guess the main underlying thesis is that with 'dead internet TM' we are forced to do this regardless (eventually), and that the platforms only provide an increasingly crumbling facade of a sensible world. Bias is not just the way we achieve finding the content we prefer, it is how we differentiate signal from noise, the real from the fake by distributing trust via the social graph, in the first place. My point is, the 'pick your own algo' meme is not some cool feature as a result of liberating ourselves from the platforms; it is the unfortunate necessity as a result of the impending wave of chaos that would otherwise engulf us; something platforms won't save us from, regardless how totalitarian they become in an attempt to keep their facade alive. Then again, I started out by saying ‘recently’, but in a sense I have just been spinning my wheels for over a year

Constant

This NOSTR stuff is not going to bring a 'better internet'; rather its embracing we passed peak-internet-goldenage already and things are bound to only get worse. Now to be fair and clear, that internet golden age was a short term fantasy where capital subsidized the server capacity believing they could effectively monetize the platforms eventually, on the one hand; and intrinsicly motivated sincere content creators able to do their thing before getting harrassed by the censorship and noise inducing wrestle for power/controll by political interests, on the other.

View quoted note →

3 replies ↓

Constant Constant@techno-ethica.com 1 year ago

A shorter way of saying it would be: Will: 'put the power back in the hands of the user' Translation:

Repeatedly nuked profile 1 year ago

There's ye olde argument that the economics of electricity and chips means duplication of labour vis a vis crawling and indexing must be aggressively minimised for a wider solution to achieve any sort of longterm viability. If you've got dozens of Nostr clients all individually crawling and indexing the same relays (as the basis for each client's 'pick-your-own-algo' feature-slash-unfortunate necessity) then it represents quite some potential heat-loss overall. Friendly sharing can help, but, outside of the right incentive structure, might be hard to extend beyond early days. An interesting takes on that challenge here:

SpringerLink

ESPRESSO: A Framework to Empower Search on the Decentralized Web - Data Science and Engineering

The increasing centralization of the Web raises serious concerns regarding privacy, security, and user autonomy. In response, there has been a rene...

1 replies ↓

Constant Constant@techno-ethica.com 1 year ago

I will have a look. The way i see things now is that each individual client (nor user using multiples of clients for that manner) wont have to perform such exercises over and over again each time. Running such an operation should result in a product (simply put a list of events), which can then be used by others. Also, these operations can vary in terms of debth and width, adjusting to usecase irt available compute and bandwidth. At @npub18zsu...8aap we call this type of operation 'pulse'; a ripple through the mess of events out there guided by a construct of biasses on npubs and lists. In any event, i gues my main argument would be that computational efficiency is irrelevent because due to spam, data curration (signal/noise diffirentiation) will be the #1 challenge and i'd argue to only way to tackle that is in in a distributed manner (i.e. relying on a network and networks of networks of people applying sensemaking for themselves). Any walled garden will either be too limited or run over by weeds with nothing in between.

1 replies ↓

Repeatedly nuked profile 1 year ago

To confidently offer what Primal offers every client would have to do what Primal is doing now, that is to say crawl and index everything. The same way Bing has to duplicate Google's (very expensive) work of crawling and indexing. There's just no getting around that at present, sans goodness-of-my-heart solutions. (Cooperative frameworks like this Espresso are well-envisioned but in early days, and address more the technical than the incentive-structure side of things.) This crawling and indexing also by nature applies to spam that has gotten past a given relay's own filter. This isn't so much about performing a computational event each time a user makes a query, as the bulk of crawling and indexing is done in anticipation of a query; minutes before, days before, years before. Rather this is about maintaining a foundation upon which algorithms of a certain type can be run. So we're talking about a related but somewhat different set of tasks here. Basically, in so far as this set goes, a 'Primal-like' user experience cannot scale on Nostr without either duplication of effort (every client sends out its own Googlebot or Bingbot) or a bandaid solution whereby the clients that are doing the costly work of crawling and indexing (the Primal or Primals) provide some kind of goodness-of-my-heart access to the fruits of their labour, and the clients that are making use of this access cross their fingers and hope things all carry on.

1 replies ↓

Constant Constant@techno-ethica.com 1 year ago

Ah, we are talking past eachother i see. Yes, primals approach is retarded and wont work. I was not catching on you were refering to that (that discussion might be all the rage right now, but was not part of the context here, so thats why). I was refering to/explaining something different entirely

1 replies ↓

Repeatedly nuked profile 1 year ago

All good, thanks for the chat! On that debate I think Primal is doing what makes sense for their users and business, and I'd go that direction too if I were Primal. Just that this development will nudge the Nostr ecosystem as a whole in another direction, as other clients can't be expected to have the budgets to do the same, nor can they bank on Primal's perpetual good graces. This new direction I think will be one where crawling and indexing (and a patched-together global view in general) is much less relevant. Like with Linux, how moving from the consumer operating system direction to the server direction suddenly made graphical interfaces (which were a weak point anyway) much less relevant.

Josua Schmid 1 year ago

Even both: Ask the librarian for a book AND walk through the library to pick one yourself.

Replies (13)