Avoiding Microservice Megadisasters
Table of Contents
by Jimmy Bogard
1. Nine and a half minutes
After converting the system to microservice architecture we had a pre-production go live.
After 18 months, and hundreds of developers on the task. They flip the switch, went to the website and nothing showed up.
Which was rather embarrassing for quite a lot of people so I said ok that's odd we expected to have some response because we've been running this application on our laptops for so long why is it failing now.
Then we updated all the time out settings to a maximum infinity timeout. After waiting nine and a half minutes, the website finally loaded. This is just a simple page showing a list of products for one of the catalogs.
What happend was a single API request calling many other services and _completely saturating their network traffic with all the API calls going on.
2. No stacktrace
With a normal application like a normal monolith, if something goes wrong there's an exception or crash it's really easy to figure out what went wrong because you have a stactrace.
Single stack trace that tells you everything that's executing at once.
With this kind of architecture where everything is an API call there's no one stack trace I instead have to go through tools like Wireshark or other network monitoring tools to figure out okay.
3. 99.99% Uptime x 200 Request = 0
We looked to see if they had any resiliency built-in things like the service brake or the circuit breaker pattern. Of course not, if any one of those API calls failed the entire quest failed.
If you look at 99.9 percent uptime times 200 calls through my probabilities class that I took in college we find that that site has a zero percent chance of ever being up so this architecture was doomed.
4. What is a service
The original service oriented architecture definition of a service is that, a service is software that is
- owned and built and run by our organization
- it's responsible for holding processing and distributing particular kinds of information within the scope of a system
- can be built deployed and run independently
Another is:
- it communicates with other consumers presenting information using conventions or contract assurances
- protect yourself against unwanted access and information loss
- handles failure conditions so that failure conditions cannot lead to information corruption
If you can kind of sum all these from pieces up in one word its autonomy
A micro service is small and focused on doing one thing well and it's autonomous.
5. Bounded Context
Domain-Driven Design (DDD)
If you're going to read the DDD book, start at the part that talks about bounding context and kind of ignore the rest because all that stuff about entities and aggregates and stuff like that is really just not that important.
Domain-driven design says I have a particular model and that model may not make sense in different contexts. So a bounded context says this is the boundary in which this particular model can be applied inside that context we have a logical unified model.
6. Duplicate Data
We can't have a system in which we don't duplicate data and in fact if you look at the real world the real world duplicates data constantly.
If you look at you're just looking at a website that is duplicating some data in the server and showing it in the browser via HTML. So immediately I have stale duplicated data, yet we're able to function just fine as a as a society knowing that I have information showing on my screen that maybe a little bit old.
Duplication of data is good well one way to go about that is to recognize that we live in a world of data duplication and in fact their original business was built on data duplication because it used to be a mail-order catalog business and that's millions of copies at the exact same duplicated data they pushed out to hundreds of millions of homes all over the world so they've already duplicated data and built a business around it so why is it any different now.
7. Service Dependency Inversion
In our original picture we had one service depending on a lot of other ones. And it is depending on those other ones is not wrong. It is wrong though that my uptime is completely dependent on the uptime of all those other services
We wanted to flip these individual arrows around. So instead of a single request from the outside world calling a multiple to myriad of internal sources, we flipped all those arrows so that each of those other boxes would be pushing information somehow into our search.
So that when an external request came in I only ever had to go to a local database that was specifically built for the purpose of search.
7.1. Localizaion
Localization this one was a bit tricky because even though the information rarely changed there was a lot of it.
Once a day they would put out a file on a shared drive there was just a flat file dump of their database that said here's all the SKUs and all the translations of every single product detail information so we said right we'll use that then we'll go get that file once a day and just do a kind of bulk insert and shove that into our document database and call it a day now.
7.2. Content
Content changed fairly frequently and those changes actually needed to make it out to production pretty quickly.
We put some triggers inside the database for us. So when individual table changed we would get a message generated to feed into our search service. We would then take that message and then update our local document database.
8. Jimmy's Law
After we fixed the problem, I went back later and talked to some of the teams. They said, everything was great but the problem was they still had the similar HR structure they used to have before.
So even though each of these teams was able to build their own application independently they were still run by crap managers. And because the only way _you could get promoted in this company was again to _manage more and more people. This meant that the product teams had to keep growing and the products themselves had to keep growing because that fit into the motivations of the individual managers.
This was because of the motivations, the external motivations placed on each of these different managers. So there's not a lot of different ways you can fix this.
Conway's Law: system will produce a design whose structures a copy of the organization's communication structure.
I see this absolutely all the time that the systems we build more or less match these the human systems that exist in the company. I don't think it's a bad thing but you have to make sure that the motivations of the teams can allow this to be set up for success.
I have a corollary to this as Jimmy's law which is:
A broken differential organization driven by meeting unhealthy goals and metrics will produce broken and dysfunctional systems.
9. Inverse Conway Maneuver
So if you have this crappy organisation and you want to build systems for it you're going to build crappy systems because Conway's law is the law.
So what you need to do is perform what's known as the inverse Conway maneuver which is to design any organization you want and then the rest of the architecture will follow kicking and screaming.
Now this this is daunting right because it says basically in order for you to build good systems you may have to reorganize your company and who else who's done that before you!
just a company of one step does that count okay so this building a building