It’s common for us to reach for analogies to physical structures, like cars and bridges, when talking about software development. We need something tangible to think about. We even talk about “software architecture”, a term directly borrowed from, well, architecture. The problem is, these analogies are usually very shallow, which results in simplifications and incorrect conclusions. So let’s think about it a bit deeper.
If you haven’t read my post on feedback and PDSA I encourage you to start there before diving into this one.
For instance, the simplified idea of designing and building a bridge is that one group of people draws the bridge on paper, then another group lays bricks or welds stuff together and, lo and behold, a bridge.
In that parallel, Invision mockups, Word documents, and UML diagrams are the equivalent of a blueprint (or technical drawing) for software, while the resulting code is the bridge — the product. At first that sounds reasonable, until you realize what a technical drawing for a bridge actually is.
A technical drawing is, well, technical. It isn’t a napkin drawing. It’s a detailed, formal specification created in a technical language for a technical audience. It’s covered by ISO standards which ensure that the specification is unambiguous. Does that remind you of anything? A detailed, unambiguous specification in a formal, technical language?
The code is not the bridge. The code is a blueprint and programming, on the other hand, is the process of creating the blueprint. In other words, programming is designing. This realization may seem major or minor, depending on where you stand, but it has profound consequences and has fundamentally changed my understanding of our job as programmers.
Let’s spend a bit more time talking about design in general, especially from the perspective of physical structures. The blueprint is a result of months or years of design, often involving miniatures in wind tunnels, computer simulations, stress testing of construction elements, and even VR/AR renderings. In other words, for engineers working on physical stuff, it’s only natural to always attempt to get as close to the real thing as possible, as quickly as possible, given the resources and tools at your disposal. And then feed that back into the blueprint. Sure, they can't have a bridge pop into existence, but they sure as hell can simulate it.
But don’t take my word for it. Gordon Murray, the designer behind the legendary McLaren F1 supercar, has recently revealed his newest creation — the T.50. In one of the numerous interviews he was asked how the design process changed since the 1990s. His answer? The biggest change are tools, which allow you to "see the car much earlier". For instance, the F1’s body was signed off in clay, which gives you the overall shape of the car, but without windows and shiny surfaces it’s hard to notice highlights. When he first saw the finished car in full glory, he noticed the spine on the top is 50 millimetres too wide. But it would’ve cost too much to alter the design at that point, and it’s been bugging him ever since. The T.50’s proportions are perfect, thought, because the tools, such as photorealistic 3D rendering and VR, allowed him to see the car very early on, as if it was really there. And it shows.
The point is, if he could 3D print the whole car out of final materials in a matter of minutes, making the production process instantaneous, he would be doing that (and driving it) multiple times a day.
So what is software production? Well, to answer this question, and find the right analogy, we’ll need to define the product and work our way backwards. A physical product is the stuff which exists in the real world in its final form and scale — a bridge you can walk on, a car you can drive. Stuff you can actually interact with.
In the case of software, the product is also the thing you can interact with — buttons you can press, forms you can fill. It’s literally the stuff on your screen. The computer uses the code to produce the actual software you’re interested in, just like workers use blueprints and CAD files to manufacture and assemble structures.
Notice one thing, though: This means the software production process is instantaneous and automatic — all it requires is hardware and electricity. Which in turn means that we have one tool at our disposal, which no physical designer's ever had — the ability to immediately turn our designs into reality, as if 3D printing a functioning car within minutes. But because we tend to think of programming as production, happening after the design is finalized, we don’t use it to its full potential.
I think this is simply reminiscent of the old days of selling and distributing software on physical media. Since we were actually selling compiled code on disks, we started thinking of creating that disk as the end goal. And of what we’re selling as the product. This linguistic quirk still influences our thinking to this day, even thought the whole economy around it works very differently now.
In reality, if we wanted to find the best modern parallel between software and physical structures, an executable is akin to a CAD file you buy online and feed into a 3D printer. While it is a product in economic terms (something you can obtain in a financial transaction), in reality it's a piece of design. The final product is built by the 3D printer, which stands for a computer in this analogy.
I believe making this understanding your second nature is really key to taping into the full potential of PDSA, feadback loops, and iterative development in terms of software. Not understanding it, on the other hands, leads to wasted opportunities.
Feedback to cost ratio
The biggest waste in the typical “design before code”, rather than “code as design”, process is that we create a too many ambiguous, informal specifications, fooling ourselves into thinking they’re equivalent to blueprints. Then, whenever an inevitable misunderstanding or a need to pivot occurs, we try to remedy by doing more of the same, instead of rethinking our process.
The elephant in the room is that programming takes time. But let me remind you the most important quote I’ve ever read:
“The design process is an iterative one. I will tell you one thing which can go wrong with it if you are not in the laboratory. In my terms design consists of:
1. Flowchart until you think you understand the problem.
2. Write code until you realize that you don’t.
3. Go back and re-do the flowchart.
4. Write some more code and iterate to what you feel is the correct solution”
— Holis A. Kinslow, 1968 NATO Software Engineering Conference
It does say start with writing the code. It’s all about doing whatever gives you the best feedback to cost ratio at any given movement, keeping in mind that time is money.
Writing code is a challenging and time consuming task, just like creating a technical drawing. If you only need to agree on the general shape of a new car, you don’t fire up CAD software — you sketch until you like the shape, but you keep in mind that aerodynamics (i.e. reality) can disagree. And you probably need to get it in a wind tunnel quickly, before you invest too much time in that particular shape.
The whole point here isn't that mockups and flowcharts are pointless, it’s that we shouldn’t stop there when it comes to iterating on the design. We should treat code as something to iterate on as well, and thus something (in a way) disposable. Which also means we should write in such way that replacing bits and pieces is easy, so that one piece of the code base doesn’t lock other pieces into place.
Ops problem now
Let’s sum it up in two sentences:
programming is design
execution is production.
But where do you fit Ops in that? To find the right analogy, we have to move away from bridges and cars and into the land of hamburgers.
A 2016 movie “The Founder”, tells the story of the beginnings of McDonald’s. There’s a scene there, in which Dick and Mac McDonald work on the burger assembly line, called the Speedee Service System, for their first fast food restaurant. Their goal? Order ready in 30 seconds, not 30 minutes.
They test multiple configurations in a rather ingenious way, by drawing the kitchen in chalk on the ground and having people pretend they’re assembling burgers there, until they converge on a design where people don’t get in each others way and burgers can be made in the prescribed time. Again, hammering on a point, they used an approach with the best learning to effort ratio.
They had the hamburger design (by analogy, the code), and the workers with the right skills (by analogy, AWS or GCP), but to ensure a steady stream of production (by analogy, websites on screens), with no delays and the highest possible quality, they needed the right supply chain and assembly line. Dick and Mac are Ops.
Let me be very clear here — ops are not assembly line workers (or at least they’re not supposed to be). With infrastructure-as-code, containers, pipelines, and clouds the worker positions are occupied by computers. Ops are the assembly line and supply chain designers who’s goal is to make sure that the product blueprint (code) is turned into products (buttons and input fields) without friction, delays, and any other forms of waste. That’s where ideas like poke yoke (ポカヨケ, meaning “mistake proofing”) come in, landing us square back in the Lean and Toyota Production System territory.
However, there’s a very important thing here to remember — the assembly line designers don’t work in a vacuum. The shape of structures — be it cars or burgers — challenges the assembly process and vice versa. That’s part of the reason why you see really fancy concept cars, which don’t care about mass production, turning into… less fancy production cars. Or why the first McDonald’s only sold one kind of burger with exactly two pickles.
There’s really no value in coming up with crazy ideas if you have no way of delivering on them in scale. On the other hand, with assembly lines evolving, you can now get one of a dozen different burgers at McDonald’s, or an Audi e-tron, looking almost like the concept car, equipped exactly to your liking.
So just like programmers should be included in the product design process, so should ops — the people designing mass production based on the ever changing design. And while including programmers is sometimes viewed as unorthodox, including ops is usually perceived as plain absurd. In many orgs, ops are treated as janitors — clean up crew to isle one, production database made a boo boo. And that’s where we reach the DevOps model.
DevOps is yet another terribly named idea which, to add insult to injury, lacks a precise definition which leads people to think that DevOps is about developers learning ops and getting access to production servers. Or system administrators setting up Docker, Ansible and Jenkins. Which is all wrong. Contrary to what the name might imply, it’s not about mixing skill sets within individuals, it’s about inviting the Ops skill set into the design process. Whether you do it with jacks of all trades or with specialists is irrelevant.
Personally, I like to think of it as removing the barrier between Dev and Ops, over which stuff is thrown, with Ops normally being expected to just deal with it. DevOps is about mutual respect and collaboration. It makes all the more sense when you realize that the Dev part includes not just developers (again, naming), but also the product and business people. It’s all weirdly simple and obvious once you get over the name and embrace the idea of consulting all fields.
One more missing piece is quality assurance. Again, in the common misunderstanding QA is something that happens after the design and production, which is how software companies are often organized. How common this is is best evidenced by the sheer number of memes about programmers “finishing” work, handing it to testers and saying “it worked on my machine”. This results in all kinds of problems, from a lack of automated tests, through releases taking ages, all the way to tensions within teams.
Amusingly, manufacturing has moved away from that decades ago. Again, building quality assurance into every single step of the design and manufacturing process is the cornerstone of the Toyota Production System and Lean. It was also the reason why Japan run circles around American car manufacturers in mid 20th century. As a matter of fact, programming best practices also left that idea in dust decades ago, but these steps and silos are so alluring we keep going back… QA is not something testers do when work is “dev complete”. It should be built into the entire process and happen continuously, not as a separate stage.
We will explore how in the future, but that’s also the point of the cycle of continuous improvement. Your process should allow you to spot defects and imperfections as soon as they happen, instead of just making sure your stuff acts “according to requirements” (which may very well be irrelevant) or that you push stuff out quickly enough.
Moreover, if you find a defect, you should study it's source and act by altering the process (e.g. shortening the feedback loop), so similar defects either don’t happen or are caught automatically in the future. Otherwise you’ll end up annoyed for 20 years because the spine on your supercar is 5 centimeters too wide. Or you sink your company Knight Capital style (google that name and prepare to be amazed). Or end up with a parking lot full of cars with panel gaps the size of the Great Canyon. And I’m not talking about Tesla, I’m talking about 60s, 70s, 80s General Motors.
There is another profound lesson here about the importance of understanding what you’re doing and why, and how packing your process full of good practices means nothing, unless psychological and structural changes follow.
A foundational part of the Toyota Production System's QA was something called an Andon (アンドン) — a cord or button used to signal a problem, like an ill fitted part, which causes the whole line to stop. Whenever it's used, engineers gather to study and act on the issue. Only once the issue is resolved (temporarily or permanently) does the line start moving again. Toyota’s been doing that kind of quality assurance since the 50s, deriving it directly from Deming.
What was GM’s idea of quality assurance back then? Well, they handed off a “dev complete” car to QA at the end of the assembly line. If the door didn’t close nicely, the QA people slammed the door with a hammer until they either closed or fell off. In the former case it went to the dealership, in the latter — to the parking lot, forming an inventory to be dealt with later. The line never stopped, though, which means later never came.
Obviously, when GM found out what Toyota’s been doing they copped it. I mean, they copied the Andon itself, as a button, but not the philosophy behind it. The plant managers were still reviewed based on the number of cars going off the line. Hence, you used the Andon, you stopped the line, you hurt your managers promotion, and you got fired. Took awhile, and direct cooperation with Toyota in the NUMMI plant, for the idea of quality over output to gain traction in the US.
On that note, it may seem that quality and output are mutually exclusive, but they really aren’t. Let’s make a simplistic calculation. If 1970s GM got a car off an assembly line 25% faster than Toyota on average, and you were just interested in the number of cars leaving the line, GM would’ve been king. But if 100% of Toyotas reached the dealerships and 50% of Chevys and Buicks ended up in the parking lot to be fixed later, Toyota would actually be selling more cars at a steadier pace. And cheaper too, because their price doesn’t include rebuilding faulty cars.
The question here is how can you achieve this in a setup where product owners have their team, programmers are decided between backend and frontend, QA are yet another team and Ops is hidden in the basement? The answer is simple: you can’t.
You need all these people working together, forming a single team. That’s the only way you can be sure that if someone comes up with a requirement or a solution which is not feasible, someone’s else is going to bring that up immediately, limiting time wasted on pursuing it or finding workarounds later on. More importantly, iterating on designs quickly requires close cooperation and swift communication between these experts, which is impossible when they each have their own backlog.
Obviously, that brings up the question of how can you be productive when everyone’s working on one thing? If you have ten things to do and ten people, isn’t it better for them to work in parallel? You'd think so, but that really only works for tasks which are either very simple or don't require creativity, which forms a relatively small subset of tasks going into designing software. Unless it’s so dull, obvious, and commoditised it’s not worth working on in the first place.
If you want to assemble ten IKEA Kallax 4x4 storage cubes then by all means get ten people to do it. They’ll build all ten in the time it takes to build one. That work scales horizontally very well. However, if you want to design a new IKEA wardrobe, that doesn’t scale horizontally at all. And, as I said many times over, programming is design work, not assembly work. Many of our problems stem from not understanding that.
Even the organization charts and the way people sit is evidence to that. In most organizations I’ve worked with, teams went like this: product, backend, frontend, QA, ops. Each with their own superiors, priorities and physical spaces. Sometimes even their own backlogs. This makes absolutely zero sense, as none of these teams could’ve delivered any value from start to finish on its own. The people in those teams shared a skill set, instead of sharing a goal.
A team is defined by the bounded context it works within and the skills needed to deliver value within that bounded context. That should be obvious, but it evidently isn’t.
The term “bounded context” comes from Domain Driven Design. It means an enclosed subset of a business domain. For instance, a car dealership can have bounded contexts such as servicing and sales. Each of these will concern themselves with cars, obviously, but the understanding (or abstraction) of a car in each one will be different:
In the context of servicing a car has screws but no price. In the sales context, it has a price but no screws. Both contexts contain the idea of a car, but it’s not interchangeable — an abstraction useful in one context will be useless in the other.
That’s incredibly important for software architecture. DDD is usually thought of as the most powerful tool for software architecture design. In fact, I already mentioned that the domain specific language (aka, ubiquitous language) is invaluable in making sure your software does what it’s supposed to be doing both now and in the future. But… DDD is also incredibly important in shaping and building teams, which is something that, in my opinion, is not talked about nearly enough.
You see, the main point of bounded contexts is to make sure the software fits in everyone’s heads — from domain experts, through UX designers all the way to programmers. Additionally, it helps ensure that one piece of a system doesn’t act as noise, or potential threat, to other pieces. Going back to the original example, let’s say you’re making changes to a Sales sub-system which shares a Car class with a Servicing sub-system. The Car class (which may also be an active record…) will have a lot of relations to bolts and break pads, which you’ll be forced to ignore (noise) but also try not to break (threat). If that sounds like a contradiction, that’s because it is. Working in bounded contexts eliminates that problem.
It’s possible that you now have a thought in your head, which goes something like this: “I know where this is going — microservices”. So let’s get that out of the way. DDD is essential for microservices, but it doesn't require microservices. It’s important to remember that microservices form a distributed system, which makes them complex if you know what you’re doing, and complicated if you don’t.
Equally important is that you don’t need cables to create boundaries between contexts, and if you’re not skilled and careful enough, you may end up having cables connecting two halves of a single bounded context. Which is when you go from complex to complicated and loose all the benefits of microservices. In other words, do DDD within a monolithic system, using the encapsulation tools provided by your language of choice, and use microservices only when you have a good reason to do that — for instance, if each bounded context is big enough that it warrants having a separate, big team working on it. This whole thing is much more complex than the "microservices good, monoliths bad" narration I encounter all too often.
And that’s how we circle back to the idea of a team and the human aspect of software architecture. Tools like DDD and microservices are meant to be used to make software development easier and make software itself easy to change.
If your domain is small and simple, you will still benefit from having a ubiquitous language, but the tools designed to cut large domains into bounded contexts, let alone microservices, will just overcomplicate a simple problem. On the other hand, if you have a number of teams, each working on a large, bounded chunk of a really massive domain, it may be a good idea to slice it into smaller applications (microservices) so they can achieve greater autonomy.
But do that only if the team feels comfortable working with a distributed system and if the organization's structure can support this, because it will add overhead. Otherwise, just introduce layers of indirection within the monolithic app. The design must fit the problem space, the traffic it must handle, but also the people who will be working on it. What it doesn’t have to fit is fashion.