An engineering team at a startup has a constant need to make decisions. Choosing the right library for a task, deciding folder structure, standardising code, automating some tasks, etc. Traditionally communication has been easy as teams have been co-located, thus focus has been on making the decision itself. But what happens when you are a distributed team or a remote first team? Worse, what happens if you are distributed both geographically and across different timezones?
This is the problem Up Learn faces. More accurately these are the constraints we are working with:
- Our engineering team is distributed across India, UK, and Brazil - with the largest timezone difference being 10 hours. The team has about 2 hours of overlapping time every day. This time needs to be used for product development work as well as for engineering decision-making.
- We want the whole engineering team involved in making decisions. We are a very small team right now (7 engineers) and we feel that at this size involving everyone is the best way to arrive at good decisions and transfer knowledge.
- We want to make decisions quickly - most decisions should take 1-4 days with only some complex ones taking up to two weeks.
- We want to have a record of decisions taken so that we have a way to revisit them and potentially revise them based on new information.
By far the most popular decision-making process for distributed teams is the RFC process. It is mature, thorough, used widely, and well understood by many engineers. Here is a thorough guide on implementing an RFC process. And this blog post does a great job of conveying why such a process is needed and what might be the challenges when implementing it.
However, we realized that something was a miss. The RFC process is great at being inclusive, but it comes in the way of having a bias for action. When is it okay for people to just act instead of creating an RFC? The RFC process also doesn’t help when you don’t really have a proper proposal in mind - you just understand the problem and have a few ideas for a solution. To help with this we adopted a variant of the RFC process that enables us to address these limitations.
We call this the FAD process. The process allows us to make decisions asynchronously, while encouraging everyone to take responsibility for issues they discover. When one sees a problem one either knows the solution or one doesn’t. If they don’t know the solution they “discuss” the problem and possible solutions. If they do know the solution then the solution either requires others to change their habits or it doesn’t. When it doesn’t then they just “inform” (FYI) others of the problem and the solution, and when it does then they ask for “approval” to make sure everyone buys into the solution and changes their habits accordingly. The FYI, Approval, and Discussion options give the process its name FAD. The flowchart below shows how the process works, and the sections below describe each of these options in greater detail.
When you see a problem, you consider possible solutions, and select a solution and are confident of the solution, you implement the solution and let everyone know about it. The decisions/solutions that fall into this category are usually automations and minor improvements that make the team’s lives easier. They usually don’t require teammates to change how they work. They may result in creation of additional documentation depending on the reason for the FYI. Examples of cases where one would be expected to take the FYI approach are:
- The version of a library your team is using has a new version available that is faster, and you know it has 0 breaking changes. You update the library version and let everyone know.
- A manual tech support task is quite painful to do and is stressful to verify that it’s been done correctly. You find a way that this can be automated. You implement the solution, document it in the runbook and let everyone know about this new way of performing such task.
- You see that your team does not understand why a certain library is being used in a non-standard way. This knowledge is currently only in your head and is not documented anywhere. You document this in an appropriate place and let everyone know about the new documentation.
When you see a problem, you consider possible solutions, and are confident about a solution, but it requires non-trivial amount of work and so needs to be scheduled in, or it requires the team changing their habits. You share the problem, the solutions you considered and the solution you propose with the pros/cons. The team reads through the proposed solution and challenges it if needed. You address the concerns raised or seek help to get the concerns addressed. Once the decision (or a modification of the decision) is approved by at least two people, you create tickets for the actions required to implement the solution (including any documentation of processes). The Approval option is the closest to the RFC process. Examples of cases where one would be expected to take the Approval approach are:
- A new scaffolding tool has been added to the repository, but it is not being used by the team yet. You identify the reason for this as the team not knowing how to use it for different tasks (e.g. adding a new model, modifying a model, adding a new query, etc). You document the steps for using the tool for each of these tasks and share them for the team’s approval as the standardised way of doing things.
- You notice that a file is getting very big. You research and find that there is a standard way of splitting the file into appropriate contents for the language/framework, but the team has not followed that organization. You discuss with others who were involved in the decision originally and figure out that this was an oversight and there isn’t a good reason to have such big files. You propose the standard splitting solution as the new way of organizing the files. You also propose a way of refactoring existing files. You document the pros and cons of the new way.
When you see a problem and you consider solutions but are not happy with any of the solutions, you share the problem and the solutions you could think of. The team proposes other solutions, and you arrive at a solution in consultation with the team. You are still responsible for driving a solution because “we take responsibility for issues”. Once there is sufficient context so that you feel a decision is clear, you move the decision to approval stage.
- The way theme information is passed to React components is inconsistent. Some components directly import the theme, others accept it as a prop, yet others use a hook. You don’t have strong opinions about which of the options is better, but you believe inconsistency will slow the team down. You highlight the problem with pros/cons of each of the three methods and put the problem up for discussion.
- You do not have performance tracing ability in the backend. There are multiple libraries, multiple paid solutions, and multiple ways of deploying such a capability in your infrastructure. You are not even sure how best to think about this problem as you’ve never done this before and building an opinion would require a lot of effort. You are also not sure if this is a problem worth solving right now. You start a discussion about this, documenting all the thoughts you have right now including a list of good questions that, when answered, will allow the team to reach a resolution. Depending on the decision taken, you either create tickets for the next steps or close the discussion by summarizing the future plan.
Two Months Later
We implemented the above process using a Gitlab project. Each topic is an issue and labels are used to indicate whether a topic is FYI, Approval, or Discussion. The team leave comments on the issue to share their opinion and the issue is closed once a topic concludes.
So what parts of the FAD process work and what didn’t? Here’s a quick summary of the state 2 months later.
- 25 topics were raised in two months. Thirteen of those were closed while 12 are still being discussed. In my subjective assessment, I consider the process a success.
- Only 2 FYI topics were raised. So the process isn’t being used heavily to inform the team about improvements. There are a couple of reasons for this. First, we use Slack at work which has been an easier tool to send FYI messages than Gitlab. Second, there generally are less FYI messages to give than approvals/discussions.
- Among the other 23 topics, 11 are approvals and 12 are discussions. This suggests there is a decent bias towards action (getting approval instead of opening a discussion). This also indicates that there are many situations where people don’t know the right solution, and the ability to have discussions (which is generally difficult to do in the RFC process) has been valuable.
- The oldest open topic is 2 months old. This indicates that a push is required to close topics. The RFC process often has a maximum open time after which the proposal is considered approved. A similar approach can be applied to close approvals quickly. However, for discussions we need a different solution. We have resorted to having a half hour call every 2 weeks to look at oldest discussions and try to bring them to a conclusion.