Eric Passmore

Software Architecture Matters

Sun, 15 Apr 2018 18:12:21 -0500

There is evidence that good software architectures and high performing teams go together. Do good architecture create good teams or do good teams create good architectures? I am not sure which is the cause and which is the effect. It did get me thinking. What would I write down for architecture guidance? Below is my list. Expect future posts on with example scenarios.

Keep it Simple
Easy to Modify
Built to Last
Speak the Same Language
Strive to Shift Left
Limit Vendor Lock-in

Keep it simple

Simple to Explain Write down the must have features , required scale/capacity, and implicit security, privacy, latency, availability requirements. Try to leverage a light weight documentation approach the involves different roles and teams. Why? When the team that writes the code, and builds the software is collaborating in requirements it leads to faster execution, and better solutions. In addition, it is a lot more fun.
Create Clarity Use analysis to minimize the number of must haves, and clarify what is a must have vs should have vs nice to have. Why? Inventing and simplifying is one of the best parts about building software. Simplifying enables better, faster, and cheaper.
Predictable and Repeatable Test to ensure behaviors remain consistent. Zero tolerance for flaky tests. Yes, judgement is required to understand what constitutes a behavior. Why? Speaking personally, I enjoy spending time thinking about how my service or feature will be used. Test driven development has made me a better coder by helping me understand the contracts I have with customers.
Assume a Trusted Core Tightly coupled code written by a small team should be trusted and leveraged when you are a member of that team. Why? It is way more fun. Write code and move at the speed of trust!

Easy to modify

Enable Personal Versions Enable personal code branches, services that can run on a local PC, access to native app emulators, and test requests. Why? Speaking personally I want a safe place to learn.
Loosely Coupled Decompose hard problems into a set of self service APIs. Why? Open APIs enable teams to work independently and move fast. It feels amazing to make small targeted changes with just a few hours of ramp up time.
Automate build, image, verify, release Continuous Integration and testing creates fast feedback loops and built-in gates eliminate fear of screwing things up. Why? Empowers engineers to make changes and improvements.
Keep an Inventory Build a list of services, applications, and useful libraries with details on who owns its and how to find the source and how to build/release the thing. Why? Feels good to self-navigate to system/service that needs improvements. From a compliance perspective having a list of things make assessment phase easier. For example, think about adding meta-data like providence for your Open Source, and most secure version of software you should leverage.

Build it to last

Explicit scale/capacity Know what scale/capacity is needed and write down a plan to get there. Socialize the plan.Why? Brownouts/Blackouts make customers unhappy.
Forward/Backward Compatibility APIs need forwards and backwards compatibility. Why? Without support for forward/backward compatibility, changes required synchronized deployment between provider and consumer. In a large scale system, it can take hours or days to roll out a change, and intentionally failing during software updates is not a level of quality any org aspires to meet. Another reason is complexity. Lack of Forward/Backwards compatibility create complexity as changes deep in the call stack will cause chaos because all upstream layers will need to deeply understand the change and compensate.
Build in coarse grain mitigation and validate with coarse grain failure injection. Let hosts (or DC) fail and route requests to healthy hosts (or DC). Journal requests and follow up until success or exhaustion. Enable cache entries to go stale when providers go down. Create partial responses or reasonable static responses when the full answer isn’t possible. Enable fast rollback to last know good (applications, config, and data) Note: Rollback often requires backwards compatibility Why do it? Software engineering is complicated and sometimes we get it wrong. When we don't have mitigation we feel like the weight of the world is on our shoulders as we generate custom hot-fixes. So Sad :(

Speak the same language

Use HTTP and JSon Use HTTP and JSon docs often and leverage the RFCs for response codes and tricky situations. Why? It is simple. Makes it way easier to understand what is going on, and engineer teams tend to be more empathetic to each other when using a common protocol.
Leverage A Few Good, Widely Used Languages and Packages When starting a new project, look around your organization to leverage widely used languages and packages. Why? Over time the different systems grow. Having 20 different languages and 300 different software stacks kills engineering fungibility. Ramp up time goes way up. In the end variance of software is a tax on every engineer in the organization. Increasing the variance of software increases the tax.

Strive to Shift left

Include Security/Privacy/Accessibility In Code Reviews Why? Engineers have context on code they just wrote, and it is much easier to solve problems when they are little.
Include Code Scan Tools in CI Pipeline Add Black Duck and tools like it to scan for security vulnerability at build time. Why? Engineers have context on code they just wrote, and it is much easier to solve problems when they are little.
Validate Key GDPR feature the Right to Forget in CI Pipeline Create negative test cases to attempt to track users who do not want to be tracked. Create negative test cases to access history after request to delete history. Why? This is a an important challenge to address, and it will take a lot of engineering time to get right. Shifting left is needed to make this more effective and efficient .
Add Audit Controls as part of CI Pipeline Every business has some key audit controls. Best to work with audit and build in the historical records, reconciliations, and access roles early. Why? This is a an important challenge to address, and it will take a lot of engineering time to get right. Shifting left is needed to make this more effective and efficient.

Limit Vendor Lock-in

Choose Technologies with Adoption Across Vendors. Specifically I am thinking of using technologies that work across both Amazon AWS vs Microsoft Azure vs Google Cloud. There is some cool stuff out there and with a little work enable compute workloads to run in multiple cloud eco-systems (run on both AWS and Azure). A caveat, Storage and Machine Learning across cloud providers is harder to pull off, and it may not be work building vendor agnostic APIs across all capabilities. Why? Selecting cloud providers is a big choice and leveraging a single vendor will have negative business impact. Past example of business impact include long periods of outages or temporarily running out of capacity in a specific region. In addition, diversity of providers enables organizations to better manage cost and capacity.
Shared Code Make key application code portability across iOS, Android, and Web applications. Why? Think of how amazing it would be to write code once and have it run everywhere. It would be very cool if leverage of common code would power accessibility features through adaptive rending.

9 Ways to Increase Productivity

Tue, 20 Sep 2016 13:38:54 -0500

Call me a cynic I have never seen Work In Progress (WIP) limits taken seriously at big companies. So here are 9 ways of increasing productivity that do not reference WIP. TLDR; Create clarity and focus by eliminating everything that is a distraction to front line individual contributors.

9 Ways

Apply a surge of resources to fix top live site issues
Treat critical priority bugs as a big deal and make sure they get fixed the first time
Require tests to verify tasks are documented at design time and implemented before production time
Automate deployments and ensure the right environment configuration
Create emergency lane for developer work to enable tasks to jump to the front of the queue
Continuous integration of builds along with basic correctness tests
Create targeted goals for responsiveness and peak requests per second
Automate provisioning to grow and shrink capacity
Enable features flags to eliminate branching and merging (branching is ok; merging sucks)

Surge on Live Site

Nothing worse than live site fires to suck the life out of a team. Fires happen at random hours and keep people up all night. Fires are unplanned, and eat into planned work. Fires are huge distractions that make it hard to do excellent work.
Why it is genius
When you use the word Surge it sounds cool. People immediately assume the effort will have an impact. Really it's all in one word.
Why it sucks
Should have never needed a Surge. Oh well, no sense in revisiting decisions of pointy haired bosses.

Fix Critical Bugs The First Time

Mostly people assume software has two levels of quality. It either works or is does not work. Sadly most code works and it looks awful. Its like an abandoned factory. All the windows in our abandoned factory have holes, it is dirty and dingy. In this environment, developers tend to get in and get out as quickly as possible. As a result they don't take the time to truly fix critical bugs. They make minimal fixes and test against one or two key scenarios. This is also know as legacy code.
Why it is genius
Quality is a self perpetuating machine. Once a high bar is set the code starts looking better. Once the code starts looking better other developers want to keep things neat and tidy. It is like a factory with shiny new windows; no one wants to throw the first rock.
Why it sucks
Now you are going to have to spend 2 weeks fixing that bug and that leaves no time for the cool kubs prototype project. Its sucks to be the first on cleanup duty.

Test Before Production (aka Clear Requirements)

It seems obvious that code should have some tests. The truth is the requirements were very vague and wishy-washy so we pretend we know what to code up. Then the tests come along and we discovery we had no-idea how it was supposed to work. Instead of man-ing up and fixing the problem we polity explain there is no bug, and it works as designed.
Why it is genius
Tests are the best requirements. If the people that wrote requirements had to write tests they would quit and run away in horror. So skip the requirements and just write the tests. Bonus points for, negative tests, tests outside the happy path.
Why it sucks
You may need to talk to customers to figure out how this feature was supposed to work so you can write the tests. In the process of talking to customer you may discover you are building the wrong thing.

Automate Deployments

We all hate waiting. Why wait for deployments?
Why it is genius
Honestly the reason to automate deployments is to eliminate human mistakes. All of those manual configurations and manual release steps causes lots of rollbacks and outages. Automation ensures we do things with precision and fewer manual errors.
** Why it sucks**
Someone needs to tell the release person who manually configures and deploys the software that he or she needs to move up the value chain. That is the polite way of saying a machine has taken their job.

Emergency Lane

In every project there is the catch-22 moment when the team realizes they need the mock data before they can code up the server side code. Hey missed dependencies happen, we call them emergencies. So put some emergency work in the the queue to create mock data. Problem solved.
Why it is genius
Emergency Lanes make escalations ok. It is the single best thing your team can do to shift from controlling change to embracing change. Without the emergency lane you are forced to sneak away from "real work" to work on critical fixes. The effort you put in is never tracked and as a result you and the rest of the team end up oversubscribed.
Why it sucks
People outside the team abuse the emergency lane to get their tasks completed. As an example, the product manager wants a shiny new feature an puts it in the emergency lane. Bottom line, only committed team members should be managing the tasks. Folks outside the committed team need to respect that.

Continuous Integration

Take your code and merge it, build it, test it, test it like production, and at the end you get a thumbs up or thumbs down. NOTE: thumbs up is good, that means it worked
Why it is genius
Finally feedback right away. You can now write awesome code and get things into production. As a developer you will be happy that your awesome code is now in production. Before Continuous Integration things failed for no apparent reason. Honestly before Continuous Integration it was almost like the system was cursed. Burning incense and wearing your lucky t-shirt was the only way to get code into production.
Why it sucks
With continuous integration you no longer have the ability to shrug and say works on my machine.

Targets for Perf and RPS

Get a napkin and write down how many requests per second (RPS) your system can handle on a good day. Right below that right down how long those requests should take to complete. Congratulations with the help of a napkin you now have a Service Level Objectives (SLO).
Why it is genius
Since you know what your service can do at peak you can tell everyone else. When that brand new mobile app goes online and slams your service with addition 10,000 RPS you can politely point to the napkin taped to the wall. PS expect the napkin to be ripped to shreds
Why it sucks
I believe I am rich and thin. My kids are geniuses. I believe my service is fast and powerful. Once measured your service will look slow and weak. SLAs crush egos. Tip: don't weigh yourself or give your kids an IQ test

Automated Growing and Shrinking Capacity

Software is amazing. Software can do almost anything including grow and shrink. Imagine if you could grow a foot taller for a basketball game! Imagine if you can slim down for the cross-country flight in economy class!
Why it is genius
No need to write super efficient code and struggle with complex caching logic. Just add more compute power and grow your way out of it. When traffic dies down get rid of the excess flab to save $$. A win-win.
Why it sucks
Growing is great as long as the load balancer and database can keep up. Often they cannot keep up. Growth is limited by other factors in the environment. The second problem is the reactive nature of growing hosts. The signal for a cluster to grow in capacity only happens after the cluster runs out of capacity. Adding new hosts to a cluster takes time. A late signal and a lag often results in capacity arriving too late.

Feature Flags

Your new middle-out algorithm for compression is genius. Making the update to the new algorithm is a big change and a big risk. If only there were some way to have both old and new algorithms at the same time. Then switch to the new algorithm when you wanted to test it.
Why it is genius
Feature flags let you push out crazy good stuff while protecting users from the associated risk of change. By adding a flag to the requests you can light up new code and test it with a small set of beta users.
Why it sucks
Putting your code behind flags is addicting. Once you experience the rush and exhilaration of feature flags you will want to do more and more. It will all come to end in a bad way. Your feature flags will need feature flags and the code will become an unreadable mess of nested switch statements. Consider yourself warned.

Start Doing Checklists

Mon, 11 Jul 2016 12:01:29 -0500

Start doing checklists! Stop doing failure mode analysis. Failure mode analysis does not work.

Checklists work.
Checklists scale to hundreds of engineers.
Checklists have a long history in fields like aviation and medicine.

My new book tells the story of migrating a 450 million user website to Azure, and shares the true story of how we almost failed to launch. I share the 76 point checklist that saved the launch, and can help any team build reliable service using cloud technology.

Failure Mode and Effects Analysis

Failure Mode and Effects Analysis is a process of looking across the entire system at all the components, brainstorming all the possible failures, and then scoring the failures to separate the most impactful events from the minor annoyances. Sounds like a great process. There are three reasons it does not work.

Teams are proud. They do not want to analyze their services for failure.
Teams are busy. Failure mode analysis seems like extra homework.
Teams do not trust management, and are afraid of being evaluated.

Said another way Failure Mode Analysis forces teams to think in very broad terms, asks them to re-evaluate their software, and guess the most probable, critical failures. Changing the way you think is hard work, and it is a change that will not happen overnight.

Checklists

Checklists are a better way to create reliable systems. Why do checklists work?

Teams accept simply stated good ideas.
Teams can react, accept or reject, specific guidance
It takes a small amount of effort to modify and improve an explicit list
Checklists make quality standards explicit

For example, a checklist item might state that Deployment of services should not impact availability or up-time. This one checklist item will help teams think through their software deployment process and the possible problems with deploying software. Once this checklist item is accepted teams take it as an explicit quality standard.

Said another way Checklists are specific which makes them easy to approach and makes self-evaluation easy. Easy self-evaluation enables teams to learn quickly.

Checklist For Reliable Cloud Services

A good battle tested checklist is worth it weight in gold. The last three chapters of this book act as a reference guide and include a 93 point checklist used to move a global service to the public cloud (76 -> 93 !? we added a few items). The checklist covers four main areas

Pre-release: things to check before every release
Deployment: capabilities needed as part of every release
Monitoring & Alerting: what telemetry to collect, measure, and alert on
Mitigation: how to make an unstable service stable

The book provides the rational for each checklist item, and show the hidden tips and tricks to implement the checklist items. Any team can take these checklist items to master large scale, cloud-native services.

Rolling out Checklists to Hundreds of Engineers

Many people ask how can we scale DevOps? It turns out checklists work pretty well for scaling practices across the organization. When checklist items are done well they stand on their own and do not require lengthy explanations or supporting documentation. This enables checklists to be shared through the entire organization.

Its best to roll out checklists in successive waves.

First execute on the most important 5-10 items
Second, kick off centralize efforts for the shared, core 5-10 items
Third teams schedule fault injections to test checklist adoption
Fourth learn and improve as a result of fault injection tests
Fifth implement 90% of the remaining checklist items

Progress not Perfection

There is one catch. Tops down mandates of checklist items will not work. Leaders need to express a tone of progress not perfection when rolling out the checklist. Checklists are a tool to be leverage for improvement. As teams mature they adopt additional items and make changes to suit the needs of their business.