On the profession side of this, if you're an engineer at RH in the thick of this...

cheschire · on March 4, 2020

If this were an outage directly caused by a natural disaster, I could understand. This outage was an availability problem. This clearly points to some prioritization problems within the leadership layers if robust and resilient infrastructure was not emphasized.

The prioritization problems may not be due to ignorance or malice though, and may be justifiable if there are other fires that are burning brighter. It's still pointing to problems though, and I think it's completely legitimate for engineers to question the stability of the company when this sort of thing happens.

At the very least as an engineer I would be asking some pointed questions of my leadership. Maybe not dusting off the resume yet, but still I'd want to get reassurance from internally that the leadership problems that caused this are being addressed.

malux85 · on March 4, 2020

Sometimes you just have to cut them some slack. Have you engineered a highly available cluster before? I'm not talking about the hot-standby postgres master that gets called on once every 2 years, but I'm talking about a 180 node Cassandra cluster thats doing 15,000 writes a second 24/7 and peaking at 60,000 writes a second every day, and you have to do node replacements every week or two because of the high load.

Or I'm talking about a 200 node hadoop cluster thats doing the electrical metering and billing for 8 million people, and is NOT allowed to stop.

Or the trading platform thats running sub millisecond trades and downtime means 300,000 $ USD per minute.

These are systems I have engineered over the last 10 years, and I can say: These things are complex and have failures in 1000 different ways, and while you're monitoring 999 of them that one thing you're not looking at is festering under the surface (your monitoring system is tracking IRQ hardware interrupt response times, right???)

Part of being in a team is everyone pulling together, and yes it's stressful at the time, but even very good management cant see all ends, just like very good engineering cant predict everything. I don't think it's useful to start pointing the finger at management and "asking some pointed questions at leadership" because sometimes everyone is doing their best. Yes we should analyse our failures so we can do better, but your tone is very accusatory, and I believe that a better approach is an all inclusive chat about how we can do better, and management saying "great job engineering" for fixing it, and giving them a break after the stressful event.

TheCondor · on March 4, 2020

Does the duration of their downtime suggest a “1/1000” unmonitored oversight? Or is it more like a threshold that was meet and probably could/should have been observed?

And FWIW, they have down time every day and weekend, at least in a virtual sense; the load does drop off in a very real sense too. You are spiritually correct, they should pull together and sort it out, and they owe nobody money here (don’t use a discount broker if you want some sort of guarantee about trades) but as a general rule you should ever feel too sorry for banker under just about any circumstances. The harshest lesson here, for everybody, was the only thing they would do for you was give you some commission free trades but that won’t work with this one, so a non-apology is what you get.

mumblemumble · on March 4, 2020

I think you may be focusing on the finger instead of the thing that it's pointing at.

The post reads to me like all those examples were meant to be concrete examples to drive home a more general argument that complex systems are, well, complex, and that there's an element of hubris in taking potshots from the peanut gallery.

tmpz22 · on March 4, 2020

I think the original point in this sub-thread boils down to: basic micro-level human error like typos + bad configuration deploys is completely understandable (to a certain extent), but macro level failures that happen by ignoring obvious trends and best practices is malfeasance.

Personally I don't think Robinhood will ever release a full honest post-mortem and so we'll never know (and never be able to judge fairly).

If the system failed by virtue of being too complex, that is also malfeasance because any devops/SRE worth their salt (as might be expected at a 7 BILLION DOLLAR company) should smell unnecessary complexity from a mile away and slowly refactor it away over the course of several years - which looking at Robinhoods downtime history they never did.

The closest example to Robinhoods engineering woes is Reddit, which throughout its early history made fairly poor infrastructure and data modeling decisions but have since repaired and improved on. We should hold Robinhood to higher expectations then Reddit for obvious reasons. Them having similar engineering capability to circa ~2012 start-up reddit is INEXCUSABLE.

eyegor · on March 4, 2020

As with any big system, spinning it up is much harder than bringing it down. After an outage, they have to stay offline to audit their systems to ensure that all the nodes are synchronized, all queued trades have been processed, and no accounts are in invalid states. I'm sure they could have restarted in a matter of minutes, but the risk is ridiculously high.

raiyu · on March 4, 2020

No doubt there are many complex systems and they inevitably go down. Every provider has suffered meaningful outages.

I think the issue here isn’t so much that the system went down but the blog post.

It’s very light on details and doesn’t go far enough in terms of re-establishing trust with the customers that were affected. Which by the looks of it is everyone attempting any trade most of the day on Monday.

luckylion · on March 4, 2020

On the other hand, they've had plenty of time and resources to do just that in a reliable fashion, it's not like it's one guy in his bedroom (I hope!). It's not like they are volunteers doing this open source for the community, they are getting paid (very well, I assume) to run the system. And Management is getting paid (even better, I assume) to make sure the priorities are right and correct decisions are taken. "Who could've known there might be a lot more traffic" sounds like somebody failed in Management, and engineering might have failed by not foreseeing the issue and/or informing Management.

Sure, don't burn people at the stake, but "hey, it's hard, don't blame them, they are doing their best" doesn't cut it for me. I'm sure they're expecting to be paid and not for someone to "do their best" to pay them.

malux85 · on March 4, 2020

Can you give me a concrete example of a massive distributed system that has zero downtime?

Because the largest distributed system I have seen and worked on was at Apple (or maybe DFP at Google) - and even though they had some of the smartest people in the world and literally billions of dollars behind them, there were still an endless list of problems and downtime events.

Spoiler alert: It doesn't exist.

luckylion · on March 4, 2020

The point isn't that "a system cannot fail", the point is "if the system fails, it's no big deal, shit happens, cut them some slack" is a weird way to look at it for corporate systems, especially in sensitive areas.

If you're running a HA system and you only need one nine to express your availability percentage, sure, sure, you have the smartest people etc and you're doing such a great job, and yeah, yeah, show me one system that has 100% uptime etc.

malux85 · on March 4, 2020

It didn't say it's no big deal, you're extrapolating and exaggerating my words because your argument is weak.

My point was that failure is inevitable in any complex system, and I was responding to the parents point that he immediately pointed the finger at management in an accusatory way, and I was saying that's not constructive.

Also your point "They expect to be paid" is actually implicitly "I expect management do do their best to pay me" - there could be a failure in the payroll system, there could be a failure in the banks, there could be many reasons outside managements control that means I'm not getting paid. I can say "why don't you have redundant payroll systems" (which is a stupid waste of resources given the cost/benefit/low failure rate) But my point is again - complex systems have failures - and SOMETIMES, JUST SOMETIMES, YOU CAN CUT THEM SOME SLACK.

0x8BADF00D · on March 4, 2020

When a fiduciary breaks their duty to their clients, you don’t cut them slack. You sue them. This isn’t like Silicon Valley where you can get away with antics like this.

malux85 · on March 4, 2020

You must be new here, welcome to late stage capitalism. Nobody rich goes to jail, and lawsuits are cost of business. You just factor them into the 5 billion dollar company, pay your 300M dollar fine and walk away a billionaire.

It sucks, I'm not defending it, but it's fact

unicornmama · on March 4, 2020

Google doesn’t target zero downtime. The marginal cost is too high. For important services (like Search page and ads) they aim for 5 nines uptime (99.999%), which translates to 5 minutes of downtime per year.

https://en.m.wikipedia.org/wiki/High_availability

C1sc0cat · on March 4, 2020

As an ex telco guy all I can say is "amateurs" :-)

czbond · on March 4, 2020

And not a mention of Erlang at all? ;)

C1sc0cat · on March 4, 2020

I wasn't in Traffic

malux85 · on March 4, 2020

I know, I worked there for 3 years on million node clusters

unicornmama · on March 4, 2020

Then certainly you understand the importance of SLOs, how SLAs regulate reliability and feature velocity.

Let’s say I’m RobinHood. Let’s pick an SLO. I think three nines monthly SLO is a good start, that budgets ~45 minutes of down time per month. Maybe I can argue for a more aggressive SLO, but let’s pick this one - because I think it will keep users relatively happy as trades aren’t blocked for more than an hour at worst. I drive an agreement with stakeholders that if we needle out of this SLO, we drop all feature work and focus on hardening reliability.

RobinHood was out for a whole day. This is unacceptable. It points to a complete organizational fuck up - product and feature development have too much power and priority at the expense of reliability.

I’m not sure that RobinHood has ever heard of SLOs or reliability engineering. I really hope their leadership is smart enough to hire and empower the right people that will drive organizational change.

malux85 · on March 4, 2020

Why would they burden themselves and their feature velocity with SLOs/SLAs when they can build a 5 billion dollar company insanely quickly even though they have downtime?

The users are not saying "We measured your 5 9's and I'm going to quit if you have 6 minutes more downtime"

Sure they lose some users who get annoyed, but they have a 5.6 billion dollar company, some users will go, a lot more are coming

unicornmama · on March 4, 2020

Users are saying “you were down for an entire day and I lost money - I’m out”.

Your reliability target is a product decision. Maybe with the right features the market will tolerate shitty unreliable financial services that falls over for an entire day. Or maybe RobinHood will go from a 5.6 billion dollar company to a zero dollar company because users hate them.

Point is high reliability is choice based on priorities - which seems like RobinHood does not care about. And I will certainly stay the fuck away from their platform.

ethbro · on March 4, 2020

> some users will go, a lot more are coming

This works in the acquisition phase, which I suspect Robinhood is nearing the end of.

Once their userbase turns into the retention or conversion (competitors have $0 trades now, too) phases, mistakes like this are much more costly in the long term.

techie128 · on March 5, 2020

You're missing the point. Reliability and Performance are features in Financial markets. It is a key feature for brokerages which they constantly advertise to differentiate themselves. These companies lay undersea cables to shave off few milli-seconds latency and pay a very hefty premium to be colocated in the same DC/rack as the stock exchange. Therefore Performance and Reliability are inseparable.

Nobody is debating whether people will continue using RH and that was never the issue. RH has massively damaged its reputation and reputation _is_ everything.

C1sc0cat · on March 4, 2020

Dialcom (Telecom Gold) in the UK was pretty close to 100% Almost survived the big storm of 87 - unfortunately the modems where on the UPS.

We built an entire new DC and had Tottenham Court Road dug up in case the Thames flooded.

In fact any big telecom will have down times for a switch (central office) measured in generations

SirLJ · on March 4, 2020

Those Silicon Valley kids can't understand reliability... Good thing my bank and my broker are not run like this... what a joke...

techie128 · on March 5, 2020

You sir have a very warped view of SV. Do not stereotype us.

SirLJ · on March 9, 2020

First hand experience... move fast and break thing, forever in beta, etc. does not always work, especially when you have proper SLAs

frockington1 · on March 4, 2020

Can you name a reputable brokerage that was down all of Monday and Tuesday this week?

Spoiler alert: it doesn't exist

plusplusc · on March 4, 2020

Interactive Brokers.

Well known in the financial community, but nearly unknown outside of it.

SirLJ · on March 4, 2020

Not true: Interactive Brokers were up as always and the API was working without a flaw...

Ntrails · on March 4, 2020

> Or the trading platform thats running sub millisecond trades and downtime means 300,000 USD per minute.

I mean, I'll bite. Assuming you only traded 6 hours a day (ie US) that'd be a 27bn dollar a year strategy, and the only way for returns to be linear and trading to be sub milli is market making/arbitrage.

That is a lot of half spreads...

techie128 · on March 4, 2020

Kudos, these are moderate sized systems you've built over your career. There are lot bigger and more mission critical systems in the world and you might build them one day.

I understand GP's tone wasn't exactly nice here. But here's the rub with RH's outage. RH is unfortunately in an industry (Finance, Healthcare, Aviation, Food, etc.) where people _need_ to trust them to be successful. The consequences of failure in these industries is very catastrophic not only for them but their clients. Sure failures happen but the scale at which RH has failed and the lukewarm response they've put out has pissed off people. I don't recall any brokerage, old or new, that has failed so catastrophically and has responded to it so poorly. If you think you have a worse example, I am all ears.

kasey_junk · on March 4, 2020

Hard to judge _worse_ in this context but while I was trading all of CME globex was down for 4 hours canceling active orders.

https://www.profit-loss.com/cme-hit-by-globex-outage/

I don’t remember them offering any apology or explanation at all.

That’s an exchange mind you where things like the global price of oil and s&p futures trade. Not a small boutique brokerage.

Further they have planned downtime every week & at that point still had planned daily downtime I think.

I think Robinhood screwed up. I think they should learn a hard lesson. But people thinking that trading is some high reliability industry haven’t spent any time in it.

The scary thing to me is are healthcare, aviation & food the same?

vsareto · on March 4, 2020

Part of AWS's sell with elasticity is only spending what you need, but those industries have redundancies or unused capacity.

Someone in one of these threads said there's a hidden DNS within VPCs that can fail and isn't scaled, so if that's true, they might just have to architect around that unless they can get AWS to change it. It's on RH for not knowing that but it's also kind of on AWS too.

But as far as what you can do, you can really only split your cash across brokerages if you want to engineer the same redundancy yourself. Otherwise, RH would need to route everything to another exchange to keep satisfying orders, and even that is just another system that could fail. Keeping all of your money in one brokerage doesn't seem ideal if you want to completely avoid downtime. Doing the same redundancy yourself with those industries isn't really practical.

wonderwonder · on March 4, 2020

Boeing's failures have killed hundreds of people. Governments still pay them and people still fly on their planes. Stores sell salmonella contaminated products all the time and people still shop there. RH's failure pales in comparison. Crypto exchanges fail all the time, people still use them. RH may lose a few customers in the short term but I see no reason they wont bounce back, they provide a product people like and the majority of people dont like change and will stay with them provided stability returns soon.

Non technical people dont want a technical apology, they just want an 'our bad, working on it' which is what was provided. The company will be fine. Should they be is another question all together.

raiyu · on March 4, 2020

Technically people still fly the old Boeing planes that don't crash. The 737MAX is still not in service, and there is a likelihood that it may never go back into service. All future orders are cancelled, and there isn't a clear pathway to the plane re-certified and more importantly for people to trust them again.

High trust systems require just that, high trust. And once broken it's hard to re-establish.

Crypto exchanges certainly have their fair issues of downtime, but don't forget that crypto exchanges for a long time operated purely for early adopters as crypto wasn't something that everyone traded. There was also less availability of competition, because again the industry was newer and there were fewer choices.

And certainly Coinbase helped to popularize crypto trading and they had their fair issues, but I don't believe they had an outage of this exact magnitude, and again they were in an early adopter area where mistakes are seen as part of the process. If not expressly, then at least subconsciously.

wonderwonder · on March 4, 2020

I think that we have entered a new 'trust' phase, where we pretty much don't care about it and just want familiarity. Look at Facebook, privacy has been violated a thousand times, and we still keep logging in. Experian is still chugging along. People used and paid for AOL for years when they did not have to.

Online consumption is different than in person. You go to a restaurant and the food is bad you probably don't go back. Online the bulk of consumers just keep going back because that's what they are used to. We love our favorites.

I remember all of AWS going down a couple years ago.

Boeing itself is fine even though one product killed hundreds. Robin Hood is going to be fine. This will be forgotten in a week.

C1sc0cat · on March 4, 2020

The Problems with the TSB in the Uk come to mind and NatWest /RBS had a similar SNAFU a few years back.

tinus_hn · on March 6, 2020

Catastrophe would be if they actually lost money. This is all indirect damage that is probably disclaimed in their terms of service.

No service guarantees 100% availability, it doesn’t exist.

LaserToy · on March 4, 2020

It is not about scale, it is about the fact that people lost real money. If you can’t make it work you should not be in that business, and I don’t really care how hard they work.

I’m taking my account off their platform.

malux85 · on March 4, 2020

Is this your first day of trading or something?

People lose money in trading all the time, for hundreds of reasons and some of those reasons are infrastructure downtime.

If your risk profile doesn't reflect that, maybe you should take your money out of trading altogether.

anon102010 · on March 4, 2020

I carry reasonable investment balances - I’m not an active trader but in this space I expect availability. I’d never put my money on RH - and this has nothing to do w risk profile

frockington1 · on March 4, 2020

I've been trading for years, would not keep a penny on that platform. They've effectively cut off all liquidity for their customers for at least 2 days during high market volatility. You are missing out on tax loss harvesting, buying dips etc.

LaserToy · on March 4, 2020

I will take it off this platform for sure.

LaserToy · on March 4, 2020

And why -2 points??? Already moving money out.

What is annoying - only 50k per day is allowed.

blackearl · on March 4, 2020

Because complaining about RH being a low quality trading platform is like being angry at the burger quality at McDonald's. You get what you pay for.

EpicEng · on March 4, 2020

Nearly every brokerage now has zero fee trades, just like RH. They weren't down Monday.

argonaut · on March 4, 2020

Nearly every online brokerage has had an outage or outages in the past.

LaserToy · on March 4, 2020

For a full day + morning and issued non apology? Also, I prefer to be with the group that is not in the “nearly all” one

dirtydroog · on March 4, 2020

Exactly. When markets are volatile I imagine they find it difficult to manage risk and so just shut everyone out and blame it on IT.

simonh · on March 4, 2020

That would be extremely illegal, at a company killing, send them to jail level and would be impossible to hide.

yuppie_scum · on March 4, 2020

Nothing was stopping the Robinhood customers from opening an eTrade or TD Ameritrade account or something and doing their trading out of that platform for the duration of the outage. Robinhood isn't really an institutional platform in my understanding anyway.

LaserToy · on March 4, 2020

How can you do it if your equity is in Robinhood?

dirtydroog · on March 4, 2020

If you used Scylla you'd have only needed 90 nodes. (Don't believe the instability rumours)

gshulegaard · on March 4, 2020

I've was a primary contributor on a migration of time series data to Scylla. As an anecdote, I once emailed our business contact about tracking down why we appeared to have data inconsistencies between our new (Scylla backed) and old system. I thought the e-mail got lost since we never heard back...until 8 months later (long after we had de-prioritized the migration since our old system was "good enough") asking if we had tried the newly released version which fixed a data loss issue.

Blew. My. Mind. Not only because of the radio silence and then dropping back in out of the blue as if no time had passed, but also because they had a data loss issue.

So rechecked out my previous branch, upgraded Scylla versions and sure enough the data differences we were noticing before appeared to be resolved. I couldn't believe the amount of time I had spent combing through my code to see if I had a hard to detect bug somewhere...but nope, it was ScyllaDB (although I am sure there were plenty of other bugs...just they weren't the cause of this specific symptom).

I am actually a fan of ScyllaDB and what is trying to do. Performance was great (as advertised) and management was simple enough; but they are going to need to work pretty hard to convince me "instability" is just rumor after that experience not too many years ago.

dirtydroog · on March 4, 2020

Well, we moved to it from Cassandra. It's yet to fall over and we're querying it maybe 200k times per second. No changes were needed from the client driver side of things too. YMMV.

ethbro · on March 4, 2020

And this is why folks still run DB2.

donavanm · on March 4, 2020

Ive seen bigger, scarier, potentially costlier time based bugs personally. I dont think this would make me reevaluate my employment if I was at robinhood. As the parent says you either learn these lessons the hard way or you havent learned them yet. Thats doesnt translate to being a “leadership failure.”

Your smaller point about prioritization is spot on though. I dont believe Ive seen any similar incidents lead to business ending outcomes. I personally point to sony or, more recently, equifax as examples of the disparity between actual business impact and technical abhorrence. In light of that why is it worth trying to preemptively solve technical challenges instead of business needs? Every calorie spent on “what if” subtracts from “whats needed.”

kerng · on March 4, 2020

Reminds me of the book Showstopper and the personal stories in - its about the creation of Windows NT. Pretty interesting how things where not so differnet some 30 years ago

In case anyone is interested: https://www.amazon.com/Show-Stopper-Breakneck-Generation-Mic...

dmix · on March 10, 2020

Interesting, so it took 5yrs, $150-million, and 250-employees to get NT shipped. Adding this one to my reading list!

bertil · on March 4, 2020

Important step though: have a retro, many maybe and write a report explaining what was messed up and how you might mitigate in the future. It looks like it’s going to be a good one. If you can share a sanitised version publicly, that would hopefully make it all a little bit more worth it.

I think I speak for everyone here if I say that, if that report is public and interesting, everyone on this thread will be happy to get you a drink.

vinaypai · on March 4, 2020

This is all true for a company that is actually pushing any boundaries as opposed to failing pathetically at a well solved problem.

indecisive_user · on March 4, 2020

Robinhood opened up stock trading to a large portion of the population that would otherwise not have been interested in traditional trading platforms with high commissions.

Their success helped to pressure companies such as TD and Schwab to mostly get rid of commissions as well, which is great for the average trader

I think Robinhood has a lot of problems, but to say they're not pushing any boundaries ignores the huge changes they've brought to the industry.

C1sc0cat · on March 4, 2020

This is the 21st century low cost trading has been around for several decades now

kube-system · on March 4, 2020

Commission-free stock trading hasn’t, though. If you’re only trading a couple of shares at a time, $5-10 for a trade is a pretty steep fee.

C1sc0cat · on March 4, 2020

Fix your stock exchanges to have stock splits then like the LSE does.

kube-system · on March 4, 2020

The fees I am referencing were imposed by brokers, not exchanges. Our exchanges have stock splits, but that still doesn't make a $10 fee on a single $50 share very palatable to the small-time investor.

C1sc0cat · on March 5, 2020

A 50$ share should be split which was my point and tbh if your only investing $50 you should not be investing in individual shares.

kube-system · on March 5, 2020

Whether or not you think that market for trading should exist, it does.

unicornmama · on March 4, 2020

Pushing the boundaries? They wrote an app that gamifies stock and options trading...