Does the duration of their downtime suggest a “1/1000” unmonitored oversight? Or...

mumblemumble · on March 4, 2020

I think you may be focusing on the finger instead of the thing that it's pointing at.

The post reads to me like all those examples were meant to be concrete examples to drive home a more general argument that complex systems are, well, complex, and that there's an element of hubris in taking potshots from the peanut gallery.

tmpz22 · on March 4, 2020

I think the original point in this sub-thread boils down to: basic micro-level human error like typos + bad configuration deploys is completely understandable (to a certain extent), but macro level failures that happen by ignoring obvious trends and best practices is malfeasance.

Personally I don't think Robinhood will ever release a full honest post-mortem and so we'll never know (and never be able to judge fairly).

If the system failed by virtue of being too complex, that is also malfeasance because any devops/SRE worth their salt (as might be expected at a 7 BILLION DOLLAR company) should smell unnecessary complexity from a mile away and slowly refactor it away over the course of several years - which looking at Robinhoods downtime history they never did.

The closest example to Robinhoods engineering woes is Reddit, which throughout its early history made fairly poor infrastructure and data modeling decisions but have since repaired and improved on. We should hold Robinhood to higher expectations then Reddit for obvious reasons. Them having similar engineering capability to circa ~2012 start-up reddit is INEXCUSABLE.

eyegor · on March 4, 2020

As with any big system, spinning it up is much harder than bringing it down. After an outage, they have to stay offline to audit their systems to ensure that all the nodes are synchronized, all queued trades have been processed, and no accounts are in invalid states. I'm sure they could have restarted in a matter of minutes, but the risk is ridiculously high.