On the profession side of this, if you're an engineer at RH in the thick of this - many have been there. It seems dire now, but in a few years the fog, panic, and haze of no sleep will become a story you tell your peers at happy hour.
Many will cast stones - but they have been there too. If they haven't, well maybe their day will also come. You may feel bad at the moment - but the best way professionally forward is "We try our best tomorrow"
If this were an outage directly caused by a natural disaster, I could understand. This outage was an availability problem. This clearly points to some prioritization problems within the leadership layers if robust and resilient infrastructure was not emphasized.
The prioritization problems may not be due to ignorance or malice though, and may be justifiable if there are other fires that are burning brighter. It's still pointing to problems though, and I think it's completely legitimate for engineers to question the stability of the company when this sort of thing happens.
At the very least as an engineer I would be asking some pointed questions of my leadership. Maybe not dusting off the resume yet, but still I'd want to get reassurance from internally that the leadership problems that caused this are being addressed.
Sometimes you just have to cut them some slack. Have you engineered a highly available cluster before? I'm not talking about the hot-standby postgres master that gets called on once every 2 years, but I'm talking about a 180 node Cassandra cluster thats doing 15,000 writes a second 24/7 and peaking at 60,000 writes a second every day, and you have to do node replacements every week or two because of the high load.
Or I'm talking about a 200 node hadoop cluster thats doing the electrical metering and billing for 8 million people, and is NOT allowed to stop.
Or the trading platform thats running sub millisecond trades and downtime means 300,000 $ USD per minute.
These are systems I have engineered over the last 10 years, and I can say: These things are complex and have failures in 1000 different ways, and while you're monitoring 999 of them that one thing you're not looking at is festering under the surface (your monitoring system is tracking IRQ hardware interrupt response times, right???)
Part of being in a team is everyone pulling together, and yes it's stressful at the time, but even very good management cant see all ends, just like very good engineering cant predict everything. I don't think it's useful to start pointing the finger at management and "asking some pointed questions at leadership" because sometimes everyone is doing their best. Yes we should analyse our failures so we can do better, but your tone is very accusatory, and I believe that a better approach is an all inclusive chat about how we can do better, and management saying "great job engineering" for fixing it, and giving them a break after the stressful event.
Does the duration of their downtime suggest a “1/1000” unmonitored oversight? Or is it more like a threshold that was meet and probably could/should have been observed?
And FWIW, they have down time every day and weekend, at least in a virtual sense; the load does drop off in a very real sense too. You are spiritually correct, they should pull together and sort it out, and they owe nobody money here (don’t use a discount broker if you want some sort of guarantee about trades) but as a general rule you should ever feel too sorry for banker under just about any circumstances. The harshest lesson here, for everybody, was the only thing they would do for you was give you some commission free trades but that won’t work with this one, so a non-apology is what you get.
I think you may be focusing on the finger instead of the thing that it's pointing at.
The post reads to me like all those examples were meant to be concrete examples to drive home a more general argument that complex systems are, well, complex, and that there's an element of hubris in taking potshots from the peanut gallery.
I think the original point in this sub-thread boils down to: basic micro-level human error like typos + bad configuration deploys is completely understandable (to a certain extent), but macro level failures that happen by ignoring obvious trends and best practices is malfeasance.
Personally I don't think Robinhood will ever release a full honest post-mortem and so we'll never know (and never be able to judge fairly).
If the system failed by virtue of being too complex, that is also malfeasance because any devops/SRE worth their salt (as might be expected at a 7 BILLION DOLLAR company) should smell unnecessary complexity from a mile away and slowly refactor it away over the course of several years - which looking at Robinhoods downtime history they never did.
The closest example to Robinhoods engineering woes is Reddit, which throughout its early history made fairly poor infrastructure and data modeling decisions but have since repaired and improved on. We should hold Robinhood to higher expectations then Reddit for obvious reasons. Them having similar engineering capability to circa ~2012 start-up reddit is INEXCUSABLE.
As with any big system, spinning it up is much harder than bringing it down. After an outage, they have to stay offline to audit their systems to ensure that all the nodes are synchronized, all queued trades have been processed, and no accounts are in invalid states. I'm sure they could have restarted in a matter of minutes, but the risk is ridiculously high.
No doubt there are many complex systems and they inevitably go down. Every provider has suffered meaningful outages.
I think the issue here isn’t so much that the system went down but the blog post.
It’s very light on details and doesn’t go far enough in terms of re-establishing trust with the customers that were affected. Which by the looks of it is everyone attempting any trade most of the day on Monday.
On the other hand, they've had plenty of time and resources to do just that in a reliable fashion, it's not like it's one guy in his bedroom (I hope!). It's not like they are volunteers doing this open source for the community, they are getting paid (very well, I assume) to run the system. And Management is getting paid (even better, I assume) to make sure the priorities are right and correct decisions are taken. "Who could've known there might be a lot more traffic" sounds like somebody failed in Management, and engineering might have failed by not foreseeing the issue and/or informing Management.
Sure, don't burn people at the stake, but "hey, it's hard, don't blame them, they are doing their best" doesn't cut it for me. I'm sure they're expecting to be paid and not for someone to "do their best" to pay them.
Can you give me a concrete example of a massive distributed system that has zero downtime?
Because the largest distributed system I have seen and worked on was at Apple (or maybe DFP at Google) - and even though they had some of the smartest people in the world and literally billions of dollars behind them, there were still an endless list of problems and downtime events.
The point isn't that "a system cannot fail", the point is "if the system fails, it's no big deal, shit happens, cut them some slack" is a weird way to look at it for corporate systems, especially in sensitive areas.
If you're running a HA system and you only need one nine to express your availability percentage, sure, sure, you have the smartest people etc and you're doing such a great job, and yeah, yeah, show me one system that has 100% uptime etc.
It didn't say it's no big deal, you're extrapolating and exaggerating my words because your argument is weak.
My point was that failure is inevitable in any complex system, and I was responding to the parents point that he immediately pointed the finger at management in an accusatory way, and I was saying that's not constructive.
Also your point "They expect to be paid" isactually implicitly "I expect management do do their best to pay me" - there could be a failure in the payroll system, there could be a failure in the banks, there could be many reasons outside managements control that means I'm not getting paid. I can say "why don't you have redundant payroll systems" (which is a stupid waste of resources given the cost/benefit/low failure rate) But my point is again - complex systems have failures - and SOMETIMES, JUST SOMETIMES, YOU CAN CUT THEM SOME SLACK.
When a fiduciary breaks their duty to their clients, you don’t cut them slack. You sue them. This isn’t like Silicon Valley where you can get away with antics like this.
You must be new here, welcome to late stage capitalism. Nobody rich goes to jail, and lawsuits are cost of business. You just factor them into the 5 billion dollar company, pay your 300M dollar fine and walk away a billionaire.
Google doesn’t target zero downtime. The marginal cost is too high. For important services (like Search page and ads) they aim for 5 nines uptime (99.999%), which translates to 5 minutes of downtime per year.
Then certainly you understand the importance of SLOs, how SLAs regulate reliability and feature velocity.
Let’s say I’m RobinHood. Let’s pick an SLO. I think three nines monthly SLO is a good start, that budgets ~45 minutes of down time per month. Maybe I can argue for a more aggressive SLO, but let’s pick this one - because I think it will keep users relatively happy as trades aren’t blocked for more than an hour at worst. I drive an agreement with stakeholders that if we needle out of this SLO, we drop all feature work and focus on hardening reliability.
RobinHood was out for a whole day. This is unacceptable. It points to a complete organizational fuck up - product and feature development have too much power and priority at the expense of reliability.
I’m not sure that RobinHood has ever heard of SLOs or reliability engineering. I really hope their leadership is smart enough to hire and empower the right people that will drive organizational change.
Why would they burden themselves and their feature velocity with SLOs/SLAs when they can build a 5 billion dollar company insanely quickly even though they have downtime?
The users are not saying "We measured your 5 9's and I'm going to quit if you have 6 minutes more downtime"
Sure they lose some users who get annoyed, but they have a 5.6 billion dollar company, some users will go, a lot more are coming
Users are saying “you were down for an entire day and I lost money - I’m out”.
Your reliability target is a product decision. Maybe with the right features the market will tolerate shitty unreliable financial services that falls over for an entire day. Or maybe RobinHood will go from a 5.6 billion dollar company to a zero dollar company because users hate them.
Point is high reliability is choice based on priorities - which seems like RobinHood does not care about. And I will certainly stay the fuck away from their platform.
This works in the acquisition phase, which I suspect Robinhood is nearing the end of.
Once their userbase turns into the retention or conversion (competitors have $0 trades now, too) phases, mistakes like this are much more costly in the long term.
You're missing the point. Reliability and Performance are features in Financial markets. It is a key feature for brokerages which they constantly advertise to differentiate themselves. These companies lay undersea cables to shave off few milli-seconds latency and pay a very hefty premium to be colocated in the same DC/rack as the stock exchange. Therefore Performance and Reliability are inseparable.
Nobody is debating whether people will continue using RH and that was never the issue. RH has massively damaged its reputation and reputation _is_ everything.
> Or the trading platform thats running sub millisecond trades and downtime means 300,000 USD per minute.
I mean, I'll bite. Assuming you only traded 6 hours a day (ie US) that'd be a 27bn dollar a year strategy, and the only way for returns to be linear and trading to be sub milli is market making/arbitrage.
Kudos, these are moderate sized systems you've built over your career. There are lot bigger and more mission critical systems in the world and you might build them one day.
I understand GP's tone wasn't exactly nice here. But here's the rub with RH's outage. RH is unfortunately in an industry (Finance, Healthcare, Aviation, Food, etc.) where people _need_ to trust them to be successful. The consequences of failure in these industries is very catastrophic not only for them but their clients. Sure failures happen but the scale at which RH has failed and the lukewarm response they've put out has pissed off people. I don't recall any brokerage, old or new, that has failed so catastrophically and has responded to it so poorly. If you think you have a worse example, I am all ears.
I don’t remember them offering any apology or explanation at all.
That’s an exchange mind you where things like the global price of oil and s&p futures trade. Not a small boutique brokerage.
Further they have planned downtime every week & at that point still had planned daily downtime I think.
I think Robinhood screwed up. I think they should learn a hard lesson. But people thinking that trading is some high reliability industry haven’t spent any time in it.
The scary thing to me is are healthcare, aviation & food the same?
Part of AWS's sell with elasticity is only spending what you need, but those industries have redundancies or unused capacity.
Someone in one of these threads said there's a hidden DNS within VPCs that can fail and isn't scaled, so if that's true, they might just have to architect around that unless they can get AWS to change it. It's on RH for not knowing that but it's also kind of on AWS too.
But as far as what you can do, you can really only split your cash across brokerages if you want to engineer the same redundancy yourself. Otherwise, RH would need to route everything to another exchange to keep satisfying orders, and even that is just another system that could fail. Keeping all of your money in one brokerage doesn't seem ideal if you want to completely avoid downtime. Doing the same redundancy yourself with those industries isn't really practical.
Boeing's failures have killed hundreds of people. Governments still pay them and people still fly on their planes. Stores sell salmonella contaminated products all the time and people still shop there. RH's failure pales in comparison. Crypto exchanges fail all the time, people still use them. RH may lose a few customers in the short term but I see no reason they wont bounce back, they provide a product people like and the majority of people dont like change and will stay with them provided stability returns soon.
Non technical people dont want a technical apology, they just want an 'our bad, working on it' which is what was provided. The company will be fine. Should they be is another question all together.
Technically people still fly the old Boeing planes that don't crash. The 737MAX is still not in service, and there is a likelihood that it may never go back into service. All future orders are cancelled, and there isn't a clear pathway to the plane re-certified and more importantly for people to trust them again.
High trust systems require just that, high trust. And once broken it's hard to re-establish.
Crypto exchanges certainly have their fair issues of downtime, but don't forget that crypto exchanges for a long time operated purely for early adopters as crypto wasn't something that everyone traded. There was also less availability of competition, because again the industry was newer and there were fewer choices.
And certainly Coinbase helped to popularize crypto trading and they had their fair issues, but I don't believe they had an outage of this exact magnitude, and again they were in an early adopter area where mistakes are seen as part of the process. If not expressly, then at least subconsciously.
I think that we have entered a new 'trust' phase, where we pretty much don't care about it and just want familiarity. Look at Facebook, privacy has been violated a thousand times, and we still keep logging in. Experian is still chugging along. People used and paid for AOL for years when they did not have to.
Online consumption is different than in person. You go to a restaurant and the food is bad you probably don't go back. Online the bulk of consumers just keep going back because that's what they are used to. We love our favorites.
I remember all of AWS going down a couple years ago.
Boeing itself is fine even though one product killed hundreds. Robin Hood is going to be fine. This will be forgotten in a week.
It is not about scale, it is about the fact that people lost real money. If you can’t make it work you should not be in that business, and I don’t really care how hard they work.
I carry reasonable investment balances - I’m not an active trader but in this space I expect availability. I’d never put my money on RH - and this has nothing to do w risk profile
I've been trading for years, would not keep a penny on that platform. They've effectively cut off all liquidity for their customers for at least 2 days during high market volatility. You are missing out on tax loss harvesting, buying dips etc.
Nothing was stopping the Robinhood customers from opening an eTrade or TD Ameritrade account or something and doing their trading out of that platform for the duration of the outage. Robinhood isn't really an institutional platform in my understanding anyway.
I've was a primary contributor on a migration of time series data to Scylla. As an anecdote, I once emailed our business contact about tracking down why we appeared to have data inconsistencies between our new (Scylla backed) and old system. I thought the e-mail got lost since we never heard back...until 8 months later (long after we had de-prioritized the migration since our old system was "good enough") asking if we had tried the newly released version which fixed a data loss issue.
Blew. My. Mind. Not only because of the radio silence and then dropping back in out of the blue as if no time had passed, but also because they had a data loss issue.
So rechecked out my previous branch, upgraded Scylla versions and sure enough the data differences we were noticing before appeared to be resolved. I couldn't believe the amount of time I had spent combing through my code to see if I had a hard to detect bug somewhere...but nope, it was ScyllaDB (although I am sure there were plenty of other bugs...just they weren't the cause of this specific symptom).
I am actually a fan of ScyllaDB and what is trying to do. Performance was great (as advertised) and management was simple enough; but they are going to need to work pretty hard to convince me "instability" is just rumor after that experience not too many years ago.
Well, we moved to it from Cassandra. It's yet to fall over and we're querying it maybe 200k times per second. No changes were needed from the client driver side of things too. YMMV.
Ive seen bigger, scarier, potentially costlier time based bugs personally. I dont think this would make me reevaluate my employment if I was at robinhood. As the parent says you either learn these lessons the hard way or you havent learned them yet. Thats doesnt translate to being a “leadership failure.”
Your smaller point about prioritization is spot on though. I dont believe Ive seen any similar incidents lead to business ending outcomes. I personally point to sony or, more recently, equifax as examples of the disparity between actual business impact and technical abhorrence. In light of that why is it worth trying to preemptively solve technical challenges instead of business needs? Every calorie spent on “what if” subtracts from “whats needed.”
Reminds me of the book Showstopper and the personal stories in - its about the creation of Windows NT. Pretty interesting how things where not so differnet some 30 years ago
Important step though: have a retro, many maybe and write a report explaining what was messed up and how you might mitigate in the future. It looks like it’s going to be a good one. If you can share a sanitised version publicly, that would hopefully make it all a little bit more worth it.
I think I speak for everyone here if I say that, if that report is public and interesting, everyone on this thread will be happy to get you a drink.
Robinhood opened up stock trading to a large portion of the population that would otherwise not have been interested in traditional trading platforms with high commissions.
Their success helped to pressure companies such as TD and Schwab to mostly get rid of commissions as well, which is great for the average trader
I think Robinhood has a lot of problems, but to say they're not pushing any boundaries ignores the huge changes they've brought to the industry.
The fees I am referencing were imposed by brokers, not exchanges. Our exchanges have stock splits, but that still doesn't make a $10 fee on a single $50 share very palatable to the small-time investor.
Many will cast stones - but they have been there too. If they haven't, well maybe their day will also come. You may feel bad at the moment - but the best way professionally forward is "We try our best tomorrow"