Preamble #
What does effective execution look like? Follow this checklist to make sure your project is setup for success.
Checklist #
Plan Early and Look Ahead #
Enumerate milestones and set dates on them.
- What are all the steps and milestones you need to achieve the objectives?
- What are conservative and optimistic estimates for each milestone?
- What resources do you need for each milestone, and when are they available?
This recommendation is the first in this checklist because it is the most important. The rest of this page does not matter if you do not start with a written plan.
Example #
You are leading a project to migrate frontend assets from Django to a CDN.
- M1 (3 days): Choice of CDN. Identify all the stakeholders for this decision, estimate costs, and get buy-in on the decision criteria. Run a proof of concept with the chosen CDN to learn about all the steps you need to implement it.
- M2 (10 days): CDN deployment. Design and implement a CI/CD pipeline that uploads the frontend assets to the CDN. Account for all workflows, including local, staging, canary, and production.
- M3 (4 days): Phased roll-out. Set up the observability needed to monitor the roll-out, and write down how you will communicate DevX changes to the rest of Engineering.
Identify Risks #
Use your plan to help enumerate all risks. Over time, iterate on a checklist within your organization that you can use consistently to brainstorm and highlight risks. Address them.
Examples #
- What new technologies or new patterns are you introducing? Are there any existing examples? What can you do to reduce this risk?
- Do the project participants have relevant experience or expertise? If not, how can you prepare them?
- What tests do you need to acquire confidence and ascertain correctness before deployment? Do you have the tools, environments, context, and data needed to implement and execute these tests?
- Has someone at your company done this before? What pitfalls can they help you identify?
Identify Dependencies #
It is tempting to be paralyzed by unknown unknowns. Use the SRE method: how much effort would it take to uncover (and address) 90% of the unknowns? 99%? 99.9%?
Example #
- What are all the relevant user stories, use cases, workflows, and API calls? (List them.)
- Do you understand all of them? If not, do you know who to ask, and have you engaged them?
Identify Stalling Periods #
Medium to large projects often require additional approvals, decisions, reviews, or roll-outs. Each step involves a waiting period, and some organizations are better at setting expectations than others.
Every time the project needs to wait, you can
- Use these periods for other projects or maintenance work;
- Start the clock early; and
- Make progress on other milestones while waiting.
Examples #
You are leading a project to migrate frontend assets from Django to a CDN.
- Procurement: procuring a new vendor requires going through Compliance and Legal. This can take a week.
- Decision: you should shop the proposal around the organization to solicit feedback, in case other teams have plans for using this CDN and custom needs/requirements.
- Phased roll-out: for a change this big, you would want to test it carefully and be able to notice bugs quickly (e.g. missing webpack chunks, stale caches, CORS issues)
- Customer Communications: Large Enterprise Customer X requires that you announce all user-facing changes one month ahead time. Does this qualify?
- Internal Communications: you need all engineers to upgrade
Node.js
to v20. Let’s give them one week’s notice.
Iterate Quickly to Test New Ideas #
Identify hypotheses that need testing, and manage risk by planning small experiments or build prototypes to collect evidence for/against them.
Example #
You are leading a project to replace AWS CloudFront with Fastly as part of an org-wide performance optimization effort.
- Build a proof of concept to compare Fastly and CloudFront.
- Upload your TLS certificates and review a Fastly configuration against a CloudFront configuration to look for potential mismatches.
- Design other tests that would let you fail fast.
Integrate and Test Early #
To mitigate risks from unknown unknowns, integrate early (to exercise new components) and schedule bug bashes. Prioritize early integration over completeness.
Example #
You are replacing a legacy orchestration framework with a new pubsub framework.
- At the earliest opportunity, migrate the smallest use case over to the new framework, even before it is fully complete. Collect feedback from early adopters.
- Over the next few days, invite stakeholders to log bugs. Triage and prioritize these.
- Use the input from above to inform the rest of your roadmap and rollout plan.
Confirm Alignment #
One of the best things you can do for your team is to specify goals in terms of business value, set constraints, then let them change the implementation details to be flexible without requiring additional approval. In addition, use constraints to help everyone get on the same page regarding the degree of investment (size) and timeline.
Example #
You are leading a project to reduce the cost of AWS GuardDuty. An earlier effort has identified that removing unnecessary DNS queries accounts for 80% of the monthly charges.
- Good Goal: reduce the monthly cost of AWS GuardDuty by 70%, starting by removing unnecessary DNS queries.
- Bad Goal: implement DNS caching.
- Constraints
- We intend to spend about 1 month on this project, no more than 6 weeks. Beyond 6 weeks, the savings are no longer worth the time.
- The optimization should result in net savings i.e. should not incur additional costs elsewhere.
- The solution should be fully automated, and require fewer than 4 hours of maintenance every year.
- The solution should not cause performance regressions, or reduce availability (empirical or theoretical.)
Identify Stakeholders #
In Identify Dependencies, I suggested finding 99% and 99.9% of the unknowns to turn them into knowns. You can address the remaining 1% by making sure the ICs (engineers, designers, analysts) are introduced to stakeholders and have a frictionless line of communication with them.
What’s a stakeholder? Someone who has a stake in the outcome. Perhaps they drove the Why that kicked off the project, contributed to the legacy implementation, have expertise in that domain, might be affected by the roll out, etc.
Example #
You are leading a project to reduce the cost of AWS GuardDuty. An earlier effort has identified that removing unnecessary DNS queries accounts for 80% of the monthly charges.
Stakeholders include
- The person in charge of improving (or tracking) company-wide gross margins. This person might be a project manager, someone from Engineering Ops or Product Ops, or someone from Financial Planning & Analysis (FP&A).
- The CSMs affiliated with enterprise customers who might see a performance regression from this optimization.
- Engineers who are on-call during the roll-out.
- Sec Ops, or whoever reviews GuardDuty findings on some recurring basis.
Once you have identified your stakeholders, create a Slack channel and invite all of them. Let your stakeholders choose to mute or leave the channel.
Reduce WIP #
Reduce the number of in-flight projects.
- At the team level, encourage having multiple people on the same project, instead of having each person work independently on individual projects.
- At the individual level, encourage working on projects serially; discourage context-switching.
See Work in progress limits for more.
Control Scope (YAGNI) #
Creative individuals tend to get ambitious at the beginning of a project. As part of Plan Early And Look Ahead and Confirm Alignment, you should review the implementation plan and verify that the team is taking on the least amount of work necessary to achieve the objectives.
I am not recommending that you take on tech debt; I am recommending that you actively avoid tech debt by
- avoiding premature abstraction
- avoiding premature development of technology that comes with maintenance cost
- reducing branches and code paths that might be used but are not required for the objective
Martin Fowler explained this in YAGNI.
Compound: Context & Expertise #
It helps to sequence projects in ways that allow people on the same team to gather context and build up expertise.
Example #
You have 2 projects on your roadmap, both of which require using some sort of HTTP or TCP proxy.
- Project A: build an egress proxy so you can monitor outgoing traffic.
- Project B: build an ingress proxy so you can shape traffic and route requests to different pods.
Project A has less risk since it is only required for observability. If this proxy fails, the platform can fall back to a direct outgoing connection; moreover, if this proxy introduces additional latency, it can be disabled while debugging.
Project B has more risk since it is on the ingress (critical) path.
Do Project A before B, and use A to provide the team with an opportunity to learn about the proxy technology, and find all its shortcomings.
Compound: Process & Expectations #
Observe how your team is executing. If you see them asking the same question more than a few times, or re-discovering how to perform the same operation, it can help to write it down explicitly as a process or a standard. Remove the cognitive overhead from routine work.
Example #
You notice that 3 different engineers had to figure out how to run large backfills independently on 3 separate occasions. Each person had their risk tolerance and took slightly different approaches to ensure that their backfills were safe and did not cause resource starvation on production. The last backfill, despite the author’s best intentions, contained a slight misconfiguration that was not caught in peer review and caused a 3-minute performance degradation. The author wrote an incident retrospective and documented some lessons learned.
As a leader, you can make sure that
- a runbook for running large backfills safely is drafted and reviewed
- all 3 engineers contribute to the runbook
- all engineers in the future know where to find this runbook
Compound: Self-Service #
Every time your team releases a product feature or internal tool, your team will have to spend some time every month supporting that feature/tool. This can be fixing bugs, or simply answering questions. An engineer once told me, “I would only have to spend 10 minutes a week answering questions. It is not too bad.” A few quarters later, the same engineer complained about having trouble focusing due to having too many interrupts in their day. Needless to say, they were wrong (but I did not let them know that I had told them so.)
Build with self-service in mind. You know you have succeeded when your internal & external users can use the feature/tool without support. 10 minutes a week is infinitely more than 0 minutes.
Example #
Your team built an internal feature flag system. Make sure that
- a short How To guide is available
- everyone knows how to find the guide
- the guide includes clear & concise descriptions of every configuration option
- the guide includes simple examples of the most common use cases
- there are pointers to the guide everywhere, including error messages, other guides, the PRD template, PR templates, etc.
- there are at least 3 examples from early adopters in the codebase
- all names are simple, obvious, and self-explanatory
Ask The Team: Where Are You Spending A Lot Of Time? #
There are often simple solutions and optimizations once these are surfaced. All of the following examples are from my personal management experience.
Examples #
- X is waiting for answers from Y, and they have been going back and forth on async such as PR review comments and Slack.
- Solution: get them in a Zoom meeting or (even better) in a meeting room.
- X keeps waiting for answers from another team, and X found themselves having to re-explain all of the context every time as on-call shifts rotate on that team. Worse, X is getting conflicting or contradictory recommendations.
- Solution: ask that team to provide a durable point-of-contact or establish a longer-term partnership.
- X is waiting for someone to approve their proposal.
- Solution: the last time this happened, I realized that X only had a vague idea of the approval requirements. Help X by figuring out whose approval is needed and nudging them. (List the names of individuals.)
- X is spending all of their time deploying to staging and testing on staging.
- Solution: help them figure out how to test locally. You (the leader) do not need to have the answers, but you have to prompt them to ask in some broad channel. I have found on several occasions that people have this misconception that they have to deploy to the cloud to use cloud services, forgetting that AWS clients can run from their laptops too.
- X is waiting on compliance approval for vendor procurement.
- Solutions include:
- Is this a new vendor? If not, there might have been a prior approval.
- Is this vendor expected to be at the level of integration that requires extensive review? Levers include spend, data classification, etc. More than once, Legal & Compliance had assumed that the vendor would be used in production when we had only planned for it to be used in CI.
- Is there another vendor you can use? For example, if you are already on Google Cloud Platform, GCP’s Cloud CDN would likely be pre-approved, so don’t go worrying about Fastly.
- Solutions include:
- X is stuck in analysis paralysis.
- Solution: nudge X to plan an experiment, or implement a proof of concept. Get agile!
- X is spending time trying to get a precise and accurate measurement of the result they are optimizing.
- Solution: ask them to reflect on the precision & accuracy needed for us to know that we have made a directional improvement.