Define Services

Cards (44)

  • Useful questions to ask as a cloud architect to help build the requirements are: Who? What? Why? When? And how?
    • The "who" is about determining not only the users of the system but also the developers and stakeholders. The aim is to build a full picture of who the system will affect, both directly and indirectly.
    • The “what” is both simple and difficult. We need to establish the main areas of functionality required, but in a clear, unambiguous manner.
  • Useful questions to ask as a cloud architect to help build the requirements are: Who? What? Why? When? And how?
    • “Why” the system is needed is a really important question. What is the problem the proposed system aims to address or solve? Without a clear understanding of the need, it is likely that ‘extra’ requirements will be added. The why will also potentially help in defining KPIs and SLOs, SLAs, etc.
    • “When” helps determine a realistic timeline and can help contain the scope
  • Useful questions to ask as a cloud architect to help build the requirements are: Who? What? Why? When? And how?
    • “How” helps to determine a lot of the non-functional requirements. These could be, for instance, how many users the system needs to support concurrently, what is the average payload size of service requests, are there latency requirements, etc. They could be that the users will be located across the world or in a particular region only.
    • Roles represent the goal of a user at some point, and they enable the analysis of a requirement in a particular context. It is important to note that a role is not necessarily a person. It is an actor on the system and could be another system such as a microservice client that is accessing another microservice. The role should describe the user’s objective when using the system. For example, the role of a shopper on an e-commerce application clearly defines what the user wants to do
  • There are many ways to determine the roles for the requirement you are working on. One process that works particularly well is:
    • First, brainstorm an initial set of roles: Write as many roles as you can think of, with each role being a single user.
    • Now organize this initial set: Here you can identify overlapping roles and related roles and group these together.
    • With the set of roles now grouped, consolidate the roles: The aim here is to consolidate and condense the roles to remove duplication.
  • There are many ways to determine the roles for the requirement you are working on. One process that works particularly well is:
    • Finally, refine the roles, including internal and external roles, and the different usage patterns. Here extra information can be provided such as the user’s level of expertise in the domain, or the frequency of use of the proposed software.
  • Identifying user roles is a useful technique as part of the requirements-gathering process. An additional technique, in particular for more important roles, can be to create a persona for the role. A persona is an imaginary representation of a user role. The aim of the persona is to help the architect and developers think about the characteristics of users by personalizing them. Often a role has multiple personas
  • Personas:
    We can think in terms of users of the system, and many requirements can be gathered this way. Using personas can provide further insights. For example, Jocelyn is a persona who is a busy working mom. Jocelyn wants to save time and money as well as perform the standard banking operations online and receive benefits such as cash back. Using a persona helps build a fuller picture of the requirements. For instance, Jocelyn’s wanting to save time indicates that the tasks to be performed should possibly be automated, which affects latency and service design.
  • Now, user stories describe one thing a user wants the system to do. They are written in a structured way typically using the form: As a [type of user] I want to [do something] so that I can [get some benefit] Another commonly used form is: Given [some context] When I [do something] Then [this should happen]
  • User Stories:
    • So when writing stories, give each story a title that describes its purpose as a starting point. Follow this with a concise one-sentence description of the story that follows one of the forms just described. This form describes the user role, what they want to do, and why they want to do it. As an example, consider a banking system and a story to determine the available balance of a bank account.
  • User Stories:
    • The title of the story could be Balance Inquiry. Then following the template we describe the story As an account holder, I want to check my available balance at any time of day, so I am sure not to overdraw my account. This explains the role, what they want to do, and why they want to do it.
  • User stories provide a clear and simple way of agreeing to requirements with a customer/end user. The INVEST criteria can be used to evaluate good user stories. Let's go through each letter of these criteria.
    • Independent: A story should be independent to prevent problems with prioritization and planning.
    • Negotiable: They are not written contracts but are used to stimulate discussion between customer and developers until there is a clear agreement. They aid collaboration.
    • Valuable: Stories should provide value to users. Think about outcomes and impact, not outputs and deliverables.
  • User stories provide a clear and simple way of agreeing to requirements with a customer/end user. The INVEST criteria can be used to evaluate good user stories.
    • Estimatable: The story must be estimatable. If it is not, it often indicates missing details or the story is too large.
    • Small: Good stories should be small. This helps keep scope small and therefore less ambiguous and supports fast feedback from users.
    • Testable: Stories must be testable so that developers can verify that the story has been implemented correctly and validate when the requirement has been met/is done.
  • Here is an example persona:
    “Jocelyn is a busy working mom who wants to access MegaCorp Bank to check her account balances and make sure that there are enough funds to pay for her kids' music and sport lessons. She also uses the web site to automate payment of bills and see her credit account balances. Jocelyn wants to save time and money, and she wants a credit card that gives her cash back.”
  • Here is an example user story for a feature: Balance Inquiry
    “As a checking account holder, I want to check my available balance at any time of day, so that I am sure not to overdraw my account.”
  • Here are a couple of examples of personas for our online travel portal.
    • Karen is a busy businesswoman who likes to take luxury weekend breaks, often booked at the last minute. A typical booking comprises a hotel and flight. Recommendations play a major role in the choice Karen makes, as does customer feedback. Karen likes to perform all operations from her phone.
  • Here are a couple of examples of personas for our online travel portal.
    • Andrew is a student who likes to travel home to visit parents and also takes vacations twice yearly. His primary concern is cost, and he will always book the lowest price travel regardless of convenience. Andrew has no loyalty and will use whichever retailer can provide the best deal.
  • Here are a couple of examples of users stories for our online travel portal.
    • For the “Search for Flight and Hotel” feature I could write:As a traveler, I want to search for a flight-hotel combination to a destination on dates of my choice, so that I can find the best price.
    • For the “Supply Hotel Inventory” feature I could write:As a hotel operator, I want to bulk supply hotel inventory, so that ClickTravel can sell it on my behalf.
  • To manage a service well, it is important to understand which behaviors matter and how to measure and evaluate these behaviors:
    • For example, for user-facing systems, was a request responded to? (which refers to availability), how long did it take to respond? (which refers to latency), how many requests can be handled? (which refers to throughput).
    • For data storage systems, how long does it take to read and write data? (that’s latency), is the data there when we need it? (that’s availability), if there is a failure, do we lose any data (that’s durability).
  • Business decision makers want to measure the value of projects. This enables them to better support the most valuable projects and not waste resources on those that are not beneficial. A common way to measure success is to use KPIs. KPIs can be categorized as business KPIs and technical KPIs.
  • Business KPIs are a formal way of measuring what the business values, such as ROI, in relation to a project or service. Others include earnings before interest and taxes or impact on users such as customer churn, or maybe employee turnover.
    • Technical or software KPIs can consider aspects such as how effective the software is through page views, user registrations, and number of checkouts. These KPIs should also be closely aligned with business objectives.
  • To be the most effective, KPIs need an accompanying goal. This should be the starting point in defining KPIs. Then for each goal, define the KPIs that will allow you to monitor and measure progress. For each KPI, define targets for what success looks like. Monitoring KPIs against goals is important to achieving success and allows readjustment based on feedback.
    • As an example, a goal may be to increase turnover for an online store, and an associated KPI may be the percentage of conversions on the website.
  • Now, a KPI is not the same thing as a goal or objective. The goal is the outcome or result you want to achieve. The KPI is a metric that indicates whether you are on track to achieve the goal.
  • KPIs Should Be SMART
    • Service level indicator is a quantitative measure of some aspect of the level of service being provided. Examples include throughput, latency, and error rate.
    • Service level objective is an agreed-upon target or range of values for a service level that is measured by an SLI. It is normally stated in the form SLI ≤ target OR lower bound ≤ SLI ≤ upper bound. An example of an SLO is that the average latency of HTTP requests for our service should be less than 100 milliseconds.
  • Service level agreement is an agreement between a service provider and consumer. They define responsibilities for delivering a service and consequences when these responsibilities are not met. The SLA is a more restrictive version of the SLO. We want to architect a solution and maintain an agreed SLO so that we provide ourselves spare capacity against the SLA.
  • Understanding what users want from a service will help inform the selection of indicators. The indicators must be measurable. For example, fast response time is not measurable, whereas HTTP GET requests that respond within 400ms aggregated per minute is clearly measurable. Similarly, highly available is not measurable, but percentage of successful requests over all requests aggregated per minute is measurable.
  • Not only must indicators be measurable, but the way they are aggregated needs careful consideration. For example, consider requests per second to a service. How is the value calculated: by measurements obtained once per second or by averaging requests over a minute? The once-per-second measurement may hide high request rates that occur in bursts of a few seconds.
  • Indicators Must Be MEASURABLE:
    • For example, consider a service that receives 1000 requests per second on even numbered seconds and 0 requests on odd numbered seconds. The average requests per second could be reported over a minute as 500. However, the reality is that the load at times is twice as large as the average. Similar averages can mask user experience when used for metrics like latency. It can mask the requests that take a lot longer to respond than the average.
  • Indicators Must Be MEASURABLE:
    • It is better to use percentiles for such metrics where a high order percentile, such as 99%, shows worst case values, while the 50th percentile will indicate a typical case.
  • The relevancy of SLOs is vital. You want objectives that help or improve the user experience. It is easy to define SLOs based around what is easy to measure rather than what is useful. For clarity, SLOs should specify how they are measured and the conditions when they are valid.
    • Consider availability as measured with an uptime check over 10 seconds aggregated per minute. It is unrealistic as well as undesirable to have SLOs with a 100% target. Such a target results in expensive, overly conservative solutions that are still unlikely to reach the SLO. It is better to track the rate at which SLOs are missed and work to improve this. In many cases, 99% may be good enough availability and be far easier to achieve as well as engineer. It is also highly likely to be much more cost effective to run.
  • The use case needs to be considered also. For example, if a HTTP service for photo uploads requires 99% of uploads to complete within 100ms aggregated per minute, this may be unrealistic or overkill if the majority of users are using mobile phones. In such a case, an SLO of 80% is much more achievable and good enough.
    • It is often ok to specify multiple SLOs. Consider the following: 99% of HTTP GET calls will complete in less than 100ms
  • This is a valid SLO but it may be the case that the shape of the performance curve is important. In this case, the SLO could be written as follows:
    • 90% of HTTP GET calls will complete in less than 50ms
    • 99% of HTTP GET calls will complete in less than 100ms
    • 99.9% of HTTP GET calls will complete in less than 500ms
  • Selecting SLOs has both product and business implications. Often tradeoffs need to be made based on constraints such as staff, time to market, and funding. As the slide states, the aim is to keep users happy, not to have an SLO that requires heroic efforts to maintain.
  • SLOS:
    • Do not make them too high: It is better to have lower SLOs to begin with and tighten them over time as you learn about the system instead of defining those that are unattainable and require a significant effort and cost to try and achieve
    • Keep them simple: More complex SLIs can obscure important changes in performance.
    • Avoid absolute values: To have a SLO that states 100% availability is unrealistic. Such a SLO increases the time to build, complexity, and cost to operate, and in most cases is highly unlikely to be required.
  • SLOs:
    • Minimize SLOs: A common mistake is to have too many SLOs. The recommendation is to have just enough SLOs to give coverage of the key system attributes.
  • In summary, good SLOs should reflect what the users care about. They work as a forcing function for development teams. A poor SLO will result in a significant amount of wasted work if it is too ambitious, or a poor product if it is too relaxed.
    • An SLA is a business contract between the service provider and the customer. A penalty will apply if the service provider does not maintain the levels agreed on. Not every service has an SLA, but all services should have SLOs.
    • As with SLOs, it is better to be conservative with SLAs because it is too difficult to change or remove SLAs that offer little value or cause a large amount of work. In addition, because they can have a financial implication through compensation to the customer, setting them too high can result in unnecessary compensation being paid. To provide protection and some level of safety, an SLA should have a threshold that is lower than the SLO. This should always be the case.