Post-Mortem: The massive lemmy.world -> lemmy.dbzer0.com federation delays.

db0@lemmy.dbzer0.com · 6 months ago

Post-Mortem: The massive lemmy.world -> lemmy.dbzer0.com federation delays.

Dessalines@lemmy.ml · edit-2 6 months ago

Glad you were able to figure this one out, I never know whether to be mad at myself or proud of my persistence when I spend like a day trying to fix something that turned out to be really simple and almost always unrelated to what I thought the problem was 😂

Edit: also if you found any performance-related config improvements, either to the postgres.conf, nginx.conf, or lemmy.hjson, please contribute them to lemmy-ansible so that all instances can benefit from what you've learned.

db0@lemmy.dbzer0.com · 6 months ago

Already sent a big pr for lemmy-doc 😊

nutomic@lemmy.ml · edit-2 6 months ago

As someone hosting a service like this, especially when it has 12K people in it, this is very scary! While 2 lemmy core developers were in the chat, the help they provided was very limited overall and this session mostly relied on my own skills to troubleshoot.

This reinforced in my mind that as much as I like the idea of lemmy (or any of the other threadiverse SW), this is only something experts should try hosting. Sadly, this will lead to more centralization of the lemmy community to few big servers instead of many small ones, but given the nature of problems one can encounter and the lack of support to fix them if they’re not experts, I don’t see an option.

I disagree with this conclusion. If you had installed Lemmy according to the official instructions, you would have the database, backend and everything else on the same server and would never have run into this particular issue. And any problems youd have would likely be noticed (and debugged) by many other instances too. Your setup is heavily customized so it is only natural that there are few people who can help with it.

Anyway its an interesting journey, thanks for writing down your experience and for improving the documenation!

db0@lemmy.dbzer0.com · edit-2 6 months ago

The official instructions do not scale nor do they work for all situations. But besides that, the problem is not that my bad setup caused a problem. Shit happens and I didn't blame anyone but myself. The problems is that when a problem occurs, one has to get lucky to get support. I don't have to even prove this. I know for sure a fact that there's lemmy instances that decommissioned because they followed the default setup, run into issues, got no support and gave up.

Edit: Also, man, from one Foss developer to another: You really have to learn to stop the instinct to say 'it broke because you did it wrong'. I know it feels unfair, but trust me, this is not the way.

nutomic@lemmy.ml · 6 months ago

I'm not saying you did it wrong, it's open source so of course you can use it in any way you like. But some ways have a higher risk of breaking than others.

Simon@lemmy.dbzer0.com · 6 months ago

This is my job, so I'll counter that this isn't realistic, and in a professional situation it would probably be hosted in kubernetes which spans multiple servers and sometimes multiple regions - I don't think the devs have a readme for that.. (or maybe they do). The point being that the official docs are geared for a hobbyist to set up a node and not having separate VMs makes sense in that scenario. However I would say that it's plain that mister db0 has a much larger instance than could be considered hobbyist at this point.

henfredemars@infosec.pub · 6 months ago

This reinforced in my mind that as much as I like the idea of lemmy (or any of the other threadiverse SW), this is only something experts should try hosting. Sadly, this will lead to more centralization of the lemmy community to few big servers instead of many small ones, but given the nature of problems one can encounter and the lack of support to fix them if they’re not experts, I don’t see an option.

This also gave me an insight about how the federation of lemmy will eventually break when a single server (say, lemmy.world) grows big enough to start overwhelming even servers who are not badly setup like mine was.

Lemmy has many scalability problems to solve, and not all of these problems are slow database queries. I believe your experience is going to become increasingly common as the community grows because that increased centralization will compound the scalability problems and continue to drive up the technical know-how required to host a successful instance. The software eventually needs to do more to detect and present operational problems to administrators in a friendly way. I2P is an example of a distributed network that's quite good at reporting issues with the node.

With that said, not everything is doom and gloom. The community has proven itself highly resilient and smart people like yourself are finding solutions. It's going to be tough road ahead.

Blaze@dormi.zone · 6 months ago

Thanks for the write up!

WeirdGoesPro@lemmy.dbzer0.com · 6 months ago

Thank you for your hard work.

Unruffled@lemmy.dbzer0.com · 6 months ago

Well that was an entertaining read! Thanks for all your efforts to keep our instance running smoothly. I have noticed it seems a bit snappier since you fixed the problem.

Jo Miran@lemmy.ml · 6 months ago

That was like reading Homer's "The Iliad".

count0@lemmy.dbzer0.com · 6 months ago

Great writeup, thank you so much for sharing!

Nothing more frustrating than googling an issue and (only) finding forum threads ending in "nvm it works now" 😬