someone - Nostr Hypermedia

someone 9 months ago

Enjoy human alignment

someone 9 months ago

I am finding Eastern models are scoring low on my leaderboard! I don't directly measure repress directly but it looks like there is correlation. @gladstein My leaderboard is mostly about healthy living, nutrition, medicinal herbs (liberation from sickcare industry), liberating technologies (nostr, bitcoin), gardening and earthworks that liberate (permaculture) https://sheet.zoho.com/sheet/open/mz41j09cc640a29ba47729fed784a263c1d08 View quoted note →

someone 9 months ago

I have no time to track people's usage but yeah if a relay wants to know more they could. Didn't even write the scripts to delete events. So all the events are filling up memory and disks. Sometimes people ask manual deletion of events and I don't reply, I have no tools to do that. I have to login and delete the notes using command line. My relays are pretty low maintenance. I rarely login and do stuff. Thanks to really solid strfry implementation it runs without problems. I limited the logs to about 4 GB. So old logs are automatically deleted. This 4 GB limit only allows recent logs to be kept. I keep those logs for any attack from an IP that may happen towards the relay. Maybe I could lower that number more since Hetzner DC seems to be handling traffic floods really well. This setting also means the drives does not fill quickly with logs. So even less maintenance for me. nos.lol does not keep IP information in logs for a long time (only recent activity which fits the 4 GB limit above). nostr.mom has a very old write policy script from initial days of Nostr. Back then before strfry I was using Cameri's relay software and needed a necessity to store IPs to be able to do IP-ban the attackers. Nowadays those don't seem to happen. There you go, some transparency for you. View quoted note →

someone 9 months ago

I guess fortune tellers losing their jobs to AI were "not in their cards"

tovima.com

ChatGPT Convinces Greek Woman to Divorce Husband

ChatGPT told the woman her spouse would cheat on her in the future with a woman named "E" leading to the wife filing for a divorce

someone 9 months ago

I think my leaderboard can be used for p(doom)! Lets say highest scores around 50 corresponds to p(doom) = 0.1 And say lowest scores around 20 corresponds to p(doom) = 0.5 Last three models that I measured are Grok 3, Llama 4 Maverick and Qwen 3. Scores are 42, 45, 41. So based on last 3 measurements average is 42.66. Mapping this to the scale above between 20 and 50: (50-42.66)/(50-20)=0.24 mapping this to the probability domain: (0.5-0.1)*0.24 + 0.1=0.196 So probability of doom is ~20% If models are released that score high in my leaderboard, p(doom) will reduce. If models are released that score low in my leaderboard, p(doom) will increase.

someone 9 months ago

Qwen 3 numbers are in! They did a good job this time, compared to 2.5 and QwQ numbers are a lot better. I used 2 GGUFs for this, one from LMStudio and one from Unsloth. Number of parameters: 235B A22B. The first one is Q4. Second one is Q8. The LLMs that did the comparison are the same, Llama 3.1 70B and Gemma 3 27B. So I took 2*2 = 4 measurements for each column and took average of measurements. My leaderboard is pretty unrelated to others it seems. Valuable in that sense, it is another non-mainstream angle for model evaluation. More info:

AHA Leaderboard

A Blog post by Emin Temiz on Hugging Face

someone 9 months ago

median p(doom) is around 30%. i am saying there could be beneficial AGI, which is something that goes against harmful AGI. so my p(doom) would be lower than this.

PauseAI

List of p(doom) values

How likely do AI various researchers believe AI will cause human extinction?

someone 9 months ago

gemma 3 fine tuning was not as effective as llama 3. it responded well to my healthy living type of datasets and learned well. but in faith, fasting and misinformation type of domains, it got stuck and doesn't want to learn more. i guess LLMs can be stubborn too!

someone 9 months ago

Qwen cooked again.

Qwen

Qwen3: Think Deeper, Act Faster

QWEN CHAT GitHub Hugging Face ModelScope Kaggle DEMO DISCORD Introduction Today, we are excited to announce the release of Qwen3, the latest additi...

China seems to be producing smarter models. But are they human aligned? Will check soon.

someone 9 months ago

there is no nostr web client that works perfectly

someone 9 months ago

Llama 4 Maverick got worse scores than Llama 3.1 405B in human alignment. I used CPU for inferencing from this size of a model (402B), and it ran fast. Being a mixture of experts it may be useful for CPU inference and having a big context useful for RAG. For beneficial answers there are other alternatives. Still it managed to beat Grok 3. I had so much expectations for Grok 3 because X is holding more beneficial ideas in my opinion. It got worse health scores compared to 3.1 and better bitcoin scores. I could post some comparisons of answers between the two. With which model should I publish comparisons? Llama 3.1 or Grok 3 or something else?

Zoho Sheet

someone 10 months ago

Grok 3 Human Alignment Score: 42 It is better in health, nutrition, fasting compared to Grok 2. About the same in liberating tech like bitcoin and nostr. Worse in the misinformation and faith domains. The rest is about the same. So we have a model that is less faithful but knows how to live a healthier life.

Benchmarking Human Alignment of Grok 3

A Blog post by Emin Temiz on Hugging Face