I am honestly kinda curious how exactly you manage your servers and keep them up to date, I feel like there is a linux patch every week or so, which would usually require a reboot. Do you all deploy live patching, how you become aware of critical stuff in your otherwise busy lives, RSS?

  • intensely_human@lemm.ee
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    I should preface by saying I’m reporting what I’ve heard from other engineers. I myself have not worked on high-uptime systems

    The only high-uptime system I built was very simple and that simplicity was the strategy. Basically it was built once, never changed, then was up continuously from 2017 to 2022 before it was retired.

    I’m not sure how I’d handle a schema change. I’m tempted to say I could have a new db with the migrations applied, then I’d replay the log of all transactions from my production database to the new database, applying a pure function to alter it as necessary to match the new structure.

    In this way I’d have a lagging real-time copy of the database ready to swap in.

    Then it would be a matter of applying enough resources to get that gap tiny, but any moment of switchover is going to have some unprocessed transactions.

    That log of unprocessed transactions, assuming I can’t eliminate it, will need to be played into the new databas before at least some subset of its related data will be valid, meaning a time gap between switchover and availability.

    I could maybe get the time gap scoped to only a few records, so that most records would work flawlessly in for users but these particular records would have their own “locked for maintenance” screens.

    Or I could just silently update the UI for any data that changes as a result of being generated in the post-migration schema, and make it clear to my users that my system isn’t guaranteed up to date except right after a page refresh. Ie sometimes changes in data can take a few seconds to be reflected at the edges.

    This is all me speculating though, I haven’t done it myself.

    I think I could probably produce a system that would work well, with the user understanding that “sometimes there’s stale data and you gotta refresh”, to have no down time and a few data mismatches resolvable down to a little UI update delay based on polling or push notifications.

    But I don’t know if I could produce a system that would go through a schema update, with no downtime, while also never displaying incorrect data.

    Maybe in a high-reliability scenario the best thing would be to put up a banner saying “system migration in progress; data error window is open until this message is gone” and have the page auto refresh.

    There could be something much simpler I’m missing though. I’d love to hear from some devops or sysadmins who can speak about it from experience.