SRE Help

Posts

The Twelve-Factor App

The Twelve-Factor App ( 12 Factor App) : 1. Codebase: One Revison Control System for entire codes. All codes related to an application (eg. source code, scripts, configuration settings) are stored in source code repository. This should be accessible for development, testing, system administration stuffs. This also accessible to all automation scripts that are part of Continuous Integration/Continuous Delivery (CI/CD) processes that are part of the enterprise's Software Development Lifecycle (SDLC). 2. Dependencies: Explicitly declare and isolate dependencies, avoid storing artifacts along with source code in the source code repository. Only code that is unique and relevant to the purpose of the application is stored in source control and any external artifacts/dependencies (eg. node.js, java.jr) should be referenced in a dependencies manifest. So that ii will be loaded accessed and loaded into memory at development, testing and production runtime. 3. Config: Store Configurat...

Tools used by SREs

Tools used by Site Reliability Engineers (Covered few): Manage and Deploy: Github, Subversion. Container Orchestration: Kubernetes/Mesosphere, Jenkins, CircleCI. Infrastructure Automation and Application Configuration: Terraform and Ansible. System Health Monitoring: ELK (Elasticsearch, Logstash, and Kibana), Datadog, Hashicorp Vault, NewRelic, AWS Security Service, Puppet, etc. Project Management and Tracking Tools: Jira, Pivotal tacker, etc.

Key Duties of SRE

Key Duties of SRE: 1. Develop and manage system documentation 2. Design, build, and maintain the core infrastructure. 3. Diagnose the system and develop fixes. 4. Focus on writing software to automate processes and reduce toil. 5. Monitor performance SRE team consist of: - Availability - Latency - Performance - Efficiency - Change management - Monitoring and alerting - Emergency response - Incident response - Preparation - Capacity planning SRE’s strategic goals are to: - Make deployment easier - Improve or maintain uptime - Establish visibility into the application performance - Set SLI’s and SLO’s and Error Budgets - Increase speed by assuming calculated risks - Eliminate toil - Reduce the cost of failure to lower new feature cycle time. Reducing toil and controlling SRE team workloads: Toil is tedious, manual, work perform by SRE teams. In your day to day work, whether you are a software developer or an architect you can be doing tasks that you do not enjoy. These are usually manual...

Error Budget

Error Budget: In short, error budget is "some budget for failure", acceptable level of failure defined already. The Key to SRE is to be able to balance the error budget against engineering time, development velocity and money. The error budget is the gap between perfect reliability and our SLO. This is a budget to be spent. Given an uptime SLO of 99.9%, after a 20 minute outage you still have 23 minutes of budget remaining for the month (43 minutes of total downtime a month). Sample Uptime references, Error Budget Policy: The Error Budget Policy is what you agree to do when the application exceeds it's error budget. This is not "pay $$$" Must be something that will visibly improve reliability. Error Budget Policy Examples: Until the application is again meeting its SLO and has some Error Budget: No New features launches allowed Spring planning may only pull Postmortem Action Items from the backlog. Software Development Team must meet with SRE team daily to ou...

SLI / SLO / SLA

What is Service Level Indicator (SLI)? Service-Level Indicator (SLI) is a direct measurement of a service’s behavior, defined as the frequency of successful probes of our system. When we evaluate whether our system has been running within SLO for the past week What is Service Level Objectives (SLO)? Service Level Objectives set a goal for how well the system should behave. Specially tracking customer experience. If customers are happy, then the SLO is being met. Typical SLO examples, Uptime of 99.9% a month (i.e 43 minutes of downtime a month). 99.99% of HTTP requests in a month succeed with a 200 OK. 50% of HTTP requests returned in under 300ms. 99% of log entries processed in under 5 minutes. What is Service Level Agreements (SLA)? Service Level Agreement is between the client and the service provider, are typically guarantees with penalties for not meeting them. A system can still be within its SLA and the customer can also feel very unhappy with the experience.

Site Reliability Principles

The Site Reliability Principles: 1. SRE needs Service Level Objectives, with consequences. 2, SREs have time to make tomorrow better than today. 3. SRE teams have the ability to regulate their workload. SRE needs Service Level Objectives, with consequences: Any organization, even without hiring a single SRE can have an Error Budget Policy. This is any lever you can use to keep your customers from experiencing pain using your application. You can implement this today, measure, account and act. Making Tomorrow Better than Today: SLOs and Error Budgets are the first step. The next step is staffing an SRE role. Know the real responsibility. Your First SRE: Defining and refining the Service Level Objectives. Best placed person to see that the Error Budget Policy is enacted when necessary. Is now responsible for making sure that the application meets the reliability expectations of its users. Another major part of the SRE is Toil, On-Call and incident management. Authoring and iterating...

What is SRE

What is SRE? S ite R eliability E ngineering originated at Google in 2003. A framework for operating large scale systems reliably. SRE own running the system in production at an operational level. Fundamentally, it's what happens when you ask a software engineer to design an operations function. So SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise, and have the ability to, substitute automation for human labor. What is day-to-day work and responsibilities of an SRE team? In general, an SRE team is responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, oncall support, and capacity planning. Many operations teams today have a similar role, but the way that SRE does it is quite different. What is CRE? C ustomer R eliability E ngineering, is the concept of taking SRE principles and helping the Cloud Customers operate Reliably at scale...

DevOps vs SRE

1. Reduce Organizational Silos: Without DevOps/SRE, Development team will always focus on the implementation of the "new features" in the application however Operation team will always focus on the stability and reliability of the application. With DevOps/SRE, it is a "Shared Ownership". Both Development team and Operations team will discuss and decide mutually on the product release and it's stability. 2. Accept failure as Normal: Without DevOps/SRE, blaming each other. With DevOps/SRE, here the priority is to restore the service, analyze the root cause and avoid the recurrence of the same issue. Hence it is Blameless Postmortems of issues. 3. Implement Gradual Change: Without DevOps/SRE, development team releases any new feature with the product and that might cause any issue with the operations. With DevOps/SRE, might follow "Canary Deployment" (and more) so that there wont be any high impact if any issue occurs. 4. Tooling and Automation: Without ...