Hi, my name is Roy Ginting. I am an IT operation (ops) and infrastructure (infra) guy. I am inspired to bring software engineering practices to manage ops and infra. Below are my views about several things related to operation and infrastructure.
Besides IT ops and infra, I like doing meditation, running, and scuba diving.
On Cloud Computing
I enjoy setup and operate a system on cloud computing platform. My first interaction with cloud computing is in October 2012 for work. I just graduated from college at that time. My first job requires me to learn AWS. I have setup a Software as a Service (SaaS) system on AWS from scratch multiple times since that time. From purchasing a domain name, setting up HTTPS, provisioning the infrastructure, deploying application, and operating both the infrastructure and the application. Not only AWS, I also learn other cloud providers such as GCP, Alibaba, and Azure while on the job. It's good to learn about multiple cloud providers to enrich your knowledge.
Choosing a specific cloud provider for your system can be a daunting task. Especially when the multi cloud buzz is sounding so loud. Each cloud providers have their own strength and weakness. Multi cloud has become an umbrella term where nobody agrees what it really means. Here is a couple scenarios that fall under multi cloud umbrella:
- Some said you are a trully multi cloud if you deploy the same workload to a multiple cloud providers. A common example for this is you have a primary system on AWS and have an active disaster recovery in GCP.
- Another example with slightly different definition of multi cloud is you have a single system with 2 different kind of workloads. Let's say system A consist of workload AA and AB. You then deploy the workload AA to a certain cloud provider and workload AB to another cloud provider. A common example for this scenario is you host your transaction system on AWS and use GCP for data analytic for your transaction system.
- The most lax definition of multi cloud is when your system is deployed on a certain cloud computing provider but it uses API from another cloud computing provider. For example. Your transaction system is hosted on AWS and it uses Google Cloud Vision AI.
Don't let multi cloud confuse you. Here is a recommendation to help you choosing a cloud provider.
- If you have an existing partnership to a cloud provider or its subsidiary, leverage that relationship. Let's say you have a partnership with Microsoft because you are using MS Office. It'll be beneficial for you to choose Azure in this scenario.
- If you happen to get a big cloud credit from a certain provider, use that cloud provider.
- Several cloud providers have a startup acceleration program. If you are a startup and planning to join a startup program from a certain cloud computing provider, you better use their cloud platform.
- If you have a geo location requirement for your system and its data, choose a cloud provider that have a data center on the required region. For example, currently only Alibaba cloud from the big 4 cloud providers (AWS, Azure, GCP, and Alibaba) are available on Indonesia.
- Try to contact a cloud provider representative and asking for training and cloud credit to try out their platform. Whichever cloud provider that granted this request, you should consider using their platform.
- The catch-all option is to use the market leader which is AWS.
On Site Reliability Engineering (SRE)
Production readiness is usually neglected in the beginning of the system development due to optimization of development speed. Development team doesn't factor operability of the software under development because it's far away from their main concern. They don't understand the benefit especially for a greenfield software development team. User requirements are ambiguous and keep changing while the timeframe doesn't change much. Development team is cutting corner when they are under time pressure. Features to help operating the software is getting discounted until the mess bite them. The system is getting less robust. People afraid to make a change because it may lead to break the system. Performance of the system is degrading to do point users are complaining and starting to leave for a competitor. It's clear that the system needs improvement in reliability. The need to automate some tasks to make a room for other thing surfaces. But the more sophisticated we make that automation, the more we become dependent on a highly skilled human operator. That is why I believe the Site Reliability Engineering (https://landing.google.com/sre/sre-book/toc/index.html) is one the appropriate model for operation.
On Infrastructure as Code (IaC)
System Administrator has been using shell script to automate tasks for ages. They write instructions in a shell script, execute the script, and wait for it's completion. So what is the difference with current automation? As time progress, storage and computing cost become cheaper. It enables us to store more data economically and make computing more accessible. One trend that enabled by this change is Immutable Infrastructure. In Immutable Infrastructure, we bake all components of a server into an image and deploy those image. Once the image is created, it's never changed in place. We create a new image if we want to update a specific component. We deploy a new server from the image, and commision the old server after the new server ready to take over. Immutable Infrastructure increase consistency and reliability of our infrastructure. It also makes deployment process simpler and more predictable. A big image storage, a fast network for image transfer, and a fast computing for baking image are backbones that make Immutable Infrastructure feasible.
In the old way, a sysadmin specifies instruction that needs to be executed. A sysadmin needs to track the state of system and figure out series of commands that need to be executed. This paradigm is known as procedural or imperative approach of automation. Another approach that is more suitable given more accesible storage and computing power is declarative approach. In this approach, ops engineer specifies the intended state of the infrastructure. A tool is responsible for tracking the state of the infrastructure. It deduces a series of instructions that need to be executed to bring the system to the intended state. The declarative approach reduces cognitive load of the ops engineer since the tool is doing tasks that previously done by the ops engineer. In the imperative approach, a sysadmin needs to write instruction into a file. An ops engineer needs to specify the intended state of a system into a file in the declarative approach. In both approaches, the engineer needs a mechanism to manage those files. One of the common solution in managing file changes are to use version control system. The combination of using declarative approach to define infrastructure, managing those definition files using version control system, and using intelligent tool to deduce instruction that needs to be executed is known as Infrastructure as Code (IaC).
|Aspects||Who is responsible?|
|Imperative (old)||Declarative (new)|
|Tracking state of execution environment||Engineer||Tool|
|Describing intended state||Engineer||Engineer|
|Figuring out series of instructions required||Engineer||Tool|
|Executing the series of instructions||Engineer||Tool|
|Verifying result of automation||Engineer||Engineer|
My go-to tools to manage infrastructure are Terraform, Ansible, and Packer. Here is recommended sources to learn about above tools and practices.
On Continuous Integration / Continuous Delivery (CI/CD)
There is natural silo between development team and operation team. It stems from different incentive between development and operation. The main incentive for development team is to deliver a new feature faster. From business perspective, delivering a new feature is great to retain and attract user. Dev team member get a dopamine kick when they see the feature they are delivering is valuable to user. Ops team get their incentive by keeping the production system reliable. The chief offending for system reliability is change to production system. If this friction aren't managed properly, it creates wall of confusion between dev and ops team. Each of them complaining that other are jeopardizing their success. This imaginery wall needs to be torn down, and replaced with a bridge. Continuous Integration / Continuous Delivery (CI/CD) is a proven approach to build the bridge.
Here is some resources to learn more about CI/CD
- Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation
- Continuous Delivery website
I have extensive experience building a CI/CD platform used for company wide. I built the platform using Jenkins 2 with declarative pipeline, containerized the build process using docker, and use Jenkins distributed build to scale the build to multiple VM in different cloud. Distributed built is a must if you build a platform used by many engineers. In my experience, there are two hardest challenges to create a CI/CD platform for a company wide. The second hardest challenge was to scale the system in the event of unpredictable build load while keeping the infrastructure cost efficient. The hardest challenge is to meet engineers needs of reliable build and deployment system, fast build time, and evolving future technology needs.
DevOps is a term that nobody knows what it really means anymore. Some organizations have it as a role. Some vendors are selling solution to DevOps problem that you may or may not have. Some institutions are selling DevOps certification for somebody who need attestation that they are doing DevOps. Some company riding the DevOps wide as a marketing gimmick. Everybody has their own definition and take on DevOps.
My take on DevOps is it's a philosophy to tear down imaginary wall separating development and ops team. Both team are working to deliver business value to user. I like to keep DevOps as philosophy because it's useful as guidance in decision making. Principles help justify the practices we commit, the process we employ, and the tooling we choose.
Here is the 3 frameworks that see DevOps as principles: