Rusty Russell's avatar
Rusty Russell
rusty@rusty.ozlabs.org
npub179e9...lz4s
Lead Core Lightning, Standards Wrangler, Bitcoin Script Restoration ponderer, coder. Full time employed on Free and Open Source Software since 1998. Joyous hacking with others for over 25 years.
The only thing dumber than talking about the Bitcoin price is making Bitcoin price predictions.
I mainly end up hiring workaholics. This is a consequence of seeking passionate, smart people who love their work. So as a manager I mainly find myself telling them to take more leave and asking pointed questions if I receive an email from them far outside hours in their TZ. But it also means I model the behavior I want, which helps me regulate my own hours. I have youngish kids, and my wife has her own career, so I try to stick to my weekly work hours. And I broadcast that to my team. I want to work with these people for a decade, so it's a marathon not a sprint.
xpay (not *coat*, thanks autocorrect!) bug reports trickle in. I'll try for a .1 release this week with fixes. I am impressed by the number of people banging on it: some of the things I knew were sub-optimal (esp if you tell it to override the pay command) now seem more important. Away early January, and Blockstream gave us all the Xmas week off, so this week is critical. Like, y'know, every other week!
So, a lovely interaction with Jeremy Rubin where he shattered my XOR simplified CTV scheme. Damn. So I'm banging my head against the problem some more. I want "txid with this input txid zeroed" but that can involve too much hashing in the worst case. Even if you move the txids to the end: about 250 GB according to my rough calc. Jeremy suggested a merkle tree, which can work, but we're getting uncomfortably far from "simple" now. Specifically, my bar is "how hard would it to be to produce this *in Script*, assuming that's fully re-enabled?". Not too bad with a known number of inputs, but I don't want to even think about dealing with arbitrary numbers. Varops budget doesn't really help here, either. Everywhere else, you can't hit the varops limit unless *your input script* is doing wild things: this would mean you can hit the limit with a single opcode in a reasonable script :( You're better off just saying "your tx which uses this opcode must have no more than 64 inputs" or "no larger than 10k", but that feels totally arbitrary. For those following along at home: CTV solves this by committing to just the number of inputs, and if that's not 1 you're kind of on your own. It's not *banned*, just shrugged. I dislike this hole, but do I dislike complexity more? This is what I ponder over morning coffee before Real Work. image
BTW, Rearden (apparently from Jeremy?) pointed out that my simplified CTV-like scheme was flawed because it didn't commit to the order of input txids. You need to xor SHA(inputnum | intxid) for each input to fix this. I still like the scheme, because it clearly commits to everything the txid commits to (with modifications required by efficiency concerns). Like a "forward txid" to mirror the normal txids which are backwards references. I should write it up, for comparison with CTV. Maybe once I've done that I'll no longer think it's a significant simplification?
I'm slowly coming around to the following roadmap: 1. Simplified CTV for whole-tx commitments (ie you *will* spend this output using a tx which exactly like X. 2. Optimised sponsors for solving the "but how do I add fees" problem in a way that doesn't drive miner centralisation. 3. Script restoration so we can don't have arbitrary limits on things like amount arithmetic and examination sizes. 4. Introspection opcode(s) so we can examine txs flexibly. 5. Script enhancements for things like merkle proofs (e.g Taproot trees) and tweaks, checksig. You could argue that #1 is simply an optimisation of #3/#4, and that's true, but it's also an obvious case (once you have #2) that we will still want even when we have all the rest.
Some nice coat bug repots coming in, from real usage. A nice "please submit a bug report" message came in (obviously, a case I thought would never get hit!). So my weekend (remember how I said I wouldn't be working all hours now the release is close? Ha!) has been occupied thinking about this. On the surface, this happens when we exactly fill the maximum capacity of a local channel, then go to add fees and can't fit (if we hit htlc max, we split into two htlcs for guys case). We should go back and ask our min-cost-flow solver for another route for the part we can't afford. This is almost certain to fail, though, because there was a reason we were trying to jam the entire thing down that one channel. But what's more interesting is what's actually happening: something I managed to accidentally trigger in CI for *another* test. See, we fail a payment at the time we get the peer's sig on the tx with them HTLC removed. But after that, there's another round trip while we clear the HTLC from the peer's tx. The funds in flight aren't *really* available again until that completes. This matters for xpay, which tends to respond to failure by throwing another payment out. This can fail because the previous one hasn't totally finished (in my test, it wasn't out of capacity, but actually hit the total dust limit, but it's the same effect: gratuitous failure on the local channel). Xpay assumes the previous failure is caused by capacity limits, and reduces the capacity estimate of the local channel (it should know the capacity, but other operations or the peer could change it, so it tries not to assume). Eventually, this capacity estimate becomes exactly the payment we are trying to make, and we hit the "can't add fees" corner case. There are four ways to fix this: 1. Allow adding a new htlc while the old one is being removed. This seems spec-legal but in practice would need a lot of interop testing. 2. Don't fail htlcs until they're completely cleared. But the sooner we report failure the sooner we can start calculating more routes. 3. If a local error happens, wait until htlcs are fully clear and try again. 4. Wait inside "injectpaymentonion" until htlcs are clear. We're at rc2, so I'm going mid-brain on this: wait for a second and retry if this happens! Polling on channel htlcs is possible, but won't win much for this corner case. Longer term, inject could efficiently retry (it can trigger on the htlc vanishing, as it's inside lightningd). But that's more code and nobody will ever care
It is important to empathize with frustrated users. It's sometimes an unattainable ideal, but who hasn't hit software that Just Doesn't Work? We don't really care if it's just something about our setup, or fundamentally broken, or a completely unhelpful error message: it's an incredibly frustrating feeling of impotence. Sure, you shouldn't take it out on the devs you aren't paying, but we're all human. I can't speak for all developers, but I became a FOSS coder in the Linux Kernel. That gave me a pretty thick skin: Linus could be an ass, and even when he was wrong there was no appeal. So I generally find it easier to sift through the users' frustrations and try to get to the problem they are having. And often it turns out, I agree! This shit should just Work Better! CLN payments are the example here, and it was never my priority. That might seem weird, but the first production CLN node was the Blockstream store. So we're good at *receiving* payments! But the method of routing and actually making payments is neither spec-defined nor a way to lose money. It's also hard to measure success properly, since it depends on the vagaries of the network at the time But it's important, turns out :). And now we see it first-hand since we host nodes at Greenlight. So this release, unlike most, was "get a new pay system in place" (hence we will miss our release date, for the first time since we switched to date-based releases). Here's a list of what we did: 1. I was Release Captain. I was next in the rotation anyway, but since this was going to be a weird release I wanted to take responsibility. 2. I wrote a compressor for the current topology snapshot. This lets us check a "known" realistic data set into the repo for CI. 3. I wrote a fake channel daemon, which uses the decompressed topology to simulate the entire network. 4. I pulled the min-cost-flow solver out of renepay into its own general plugin, "askrene". This lets anyone access it, lets @lagrange further enhance it, and makes it easier for custom pay plugins to exist: Michael of Boltz showed how important this is with mpay. 5. A new interface for sending HTLCs, which mirrors the path of payments coming from other nodes. In particular, this handles self-pay (including payments where part is self-pay and part remote!) and blinded path entry natively, just like any other payment. 6. Enhancements and cleanups to our "libplugin" library for built-in plugins, to avoid nasty hacks pay has to do. 7. Finally, a new "xpay" command and plug-in. After all the other work, this was fairly simple. In particular, I chose not to be bound to the current pay API, which is a bit painful in the short term. 8. @Alex changed our gossip code to be more aggressive: you can't route if you can't see the network well! Importantly, I haven't closed this issue: we need to see how this works in the Real World! Engineers always love rewriting, but it can actually make things worse as lessons are lost, and workarounds people were using before stop being effective. But after this fairly Herculean effort, I'm going to need to switch to other things for a while. There are always other things to work on!
Writing release notes is fun, but the part I really like in the release process is preparing the first commits for the *next* release: 1. BOLT spec updates. We check all the BOLT quotes in our source, and have a script to update the spec version one commit at a time. This is a grab bag of typo fixes, feature merges (which may mean we no longer need our local patches), and occasionally major changes. It's unpredictable enough that I enjoy it 2. Removing long-deprecated features. We now give a year, then you can enable each deprecated feature individually with a configuration flag, then (if we haven't heard complaints!) we finally remove it. This means removing code (usually ugly shim code) and is a genuine joy. I've started this for 25.02, and it's a balm after the release grind...
The problem with CTV is fees. When you look at most designs using CTV, they need *another* tx, and an anchor output, so they can pay fees. What they really want is "this tx, plus an input and optional change output". People tend to ignore fees in their protocol design. But I'm implementation they're critical, and only getting more so. Lightning has been down this path!
Trying to do everything, but mainly doing a little of everything badly. #CLN release is late, I've been ignoring the BOLTs changes,, I haven't even looked at GSR since my OP_NEXT talk, and my comprehensive list of opcodes for introspection is still stuck in my head. I try to take time to post on #nostr over morning coffee though, since doing that every day helps move things forward.