There are many instances that companies migrate to the cloud for operational excellence & cost efficiencies to find out that costs spiral out of control in no-time. The reason is the lack of governance and operational control, where developers are left with unlimited resources and unlimited options.
With our many years of experience working with different cloud platforms such as AWS, Azure, GCP & OCI, below tips are common regardless of the cloud platform and services used. Please feel free to add comments on any other useful tips you have.
Architect for optimal efficiency.
Rightsizing
Elasticity is an advantageous feature of the cloud where your systems could be dynamically scaled up or down according to load. But the systems need to be architected in a way to properly implement this feature. Automatic scaling and load balancing need to be implemented and auto-scale rules and proper thresholds need to be configured.
Have proper thresholds and monitoring
Automation goes hand in hand with the cloud. Automatic scaling is based on the thresholds you configure. Monitoring the usage & identifying proper thresholds are important steps to ensure that your systems scale to the right size at the right time. For resources that do not (or need not to) automatically scale, it is important to periodically monitor the usage and re-size resources accordingly.
Use only required components without over-complexing
With many tools & services available in the cloud, it is a trend that a lot of developers are over complicating cloud applications by trying to use everything at their disposal which ultimately overcomplex the architecture, make systems less reliable & more expensive.
IaaS vs PaaS vs SaaS
Based on business & tech requirements such as the uptime required, ease of management, resources available, level of control needed, etc. a careful decision should be made on which components would be going to IaaS vs PaaS vs SaaS as cost of implementing each model is different.
Going all to cloud vs hybrid model
If infrastructure resources are already invested and if a subsystem does not need or benefit from the features of the cloud, it might be best to keep those on local data centers and only move the required components to the cloud.
Power-scheduling resources
Cloud platforms provide an option to power down resources when those are not in use (e.g.: at night) Based on the requirement of the system, some of the components could be automated to schedule power down and power up events according to time in order to save money.
Use the cost calculators effectively to predict the budget
Cloud platforms provide cost estimators/calculators to calculate the TCO (total cost of ownership) of a solution upfront. Using it while designing a system could allow budget-based decision making to be done in the earlier phases & see if the technical design is financially viable.
Use automation
Automation is paramount in the cloud to manage the infrastructure at scale. Use various automation mechanisms provided by each platform to automate tasks such as power scheduling, data retention, tagging, etc.
Data retention
Again, with unlimited resources available in the cloud, it is highly possible for data to silently accumulate leading to hefty bills. So, whether it’s a data lake or a file store etc., ensure that there are proper retention rules configured and it is fully automated.
Proper service level for each service
Some of the services in the cloud have different service levels based on various feature options such as redundancy levels, performance, etc. Ensure that you only use the correct service level as required. For example, disk storage might have options such as HDD, standard SDD, and SDD with higher IOPS. Or a there might be cheaper archive storage to keep backups, where you’re keeping very old backups on hot storage. It is the same with redundancy requirements too.
Understand the spend
Analyze monthly bill using the free tools provided by the cloud platform
Forecast budget based on current spending and usage patterns
Investigate billing anomalies and fix the underlying issues on a regular basis.
Reconciliation of invoices – compare and contrast the monthly usage patterns to see trends and co-relate it to changes made/releases done.
Organize resources in proper organizational groups
Tag resources to identify the products, features, and resources that are costing more.
Proper tagging allows the costs to be segmented & owned to different org. groups in a meaningful way. Automatic tagging is important to ensure that everything is tagged without a larger chunk of unknown spending.
Third-party tools
We’re not going to recommend any particular tool since we use multiple tools based on clients and requirements. But most of these tools have very good reporting dashboards, support multi-cloud environments, does best practice analyze and provide suggestions, and have auto-pilot mode where it drives cost/performance optimizations according to pre-configured rules. (Use autopilot with caution ensuring that the rules are correct! Test, test, test!)
You could listen to the category leaders of Technical Business Management (TBM) solutions – Apptio discusses how their solutions help clients to save money. Please see the video here https://www.cms.lk/cloud-cost-optimization-with-apptio/
Purchase right type of resource for the right purpose
There are many purchasing models such as Free, Reserved instances vs spot instances vs pay as you go model. Free might be the best option for individual R&D and learning work. BigData workloads, CI/CD workloads, Containerized workloads might benefit from spot instances instead of on-demand or reserved instances. Long term predictable workloads might be best run on reserved instances.
Separation between R&D, QA, Pre-Prod and Production environments
A proper separation between work environments on a logical level or even an account level is helpful to clearly identify & ensure that the levels of service, redundancy, etc. are applied only as required.
Consistent monitoring
Track unused resources (unused IPs, unused VMs, unused disk, and storage, unused LBs) to ensure that you don’t continue to pay for them. Having tags updated and resource usage monitored is important for this.
Governance
Access control
If everyone can create resources wherever & whenever they want, the chances are that the organization would end up with hefty bills at some point and need to spend a lot of effort to right-size later. It is important to provide the proper level of access to each environment based on the teams/roles.
Implement proper change management and CAB approval processes
Ensure configuration changes that are done to your cloud infrastructure go through a change approval board or any change management approval process your organization has.
Implement proper quota limits to prevent unplanned spending
Have virtual quota limits for resources that could be created, so the chances of someone (or some script) mistakenly spin up a lot of resources at once would be reduced. This might be seen as a bottleneck, but the trick is to identify the correct quota limits which would be required for normal operations.
Make use of free tools that are available for you to check the compliance with best practices.
Each cloud platform has its own suite of tools, best practice analyzers, suggestion engines, etc. Make use of those free tools often– those would be the starting point to identify immediate and obvious issues.
Financial decisions as part of budgeting – Reservations, Bulk purchasing
Ensure that you review your usage requirements and forecasting and reserve/bulk purchase resources accordingly. This can save a significant amount of money.
Use various license benefit schemes properly
Cloud providers especially providing stand-alone licensing have hybrid licensing modes (Microsoft, Oracle) where you could benefit from various benefit schemes on licensing.
Going beyond
Need for iterations and continuous reviewing
One-time review, optimization, and automation is not going to work as long as there are continuous changes done in the systems (code, platform, etc). So continuous reviewing and optimizations are required. A lot of organizations go for a few rounds of consultancies to be back at square one after some time.
Integrate multiple subscriptions, accounts, and if multi-cloud is used use tools to integrate all bills to have a holistic view of the expenses.
Features such as Azure lighthouse, many third-party tools, and even some built-in tools help you to have a holistic view of all of your cloud expenses.
Have a group responsibility as a company so that finance & technology teams work together on the bills.
Ball passing from finance to operations to development to R&D at the end of the day does not help to reduce your bills. Getting expert advice and working with all of your teams together would help to manage bills.
Hire someone who is experienced in handling cloud
This could be a dedicated cloud engineer, a dev-ops engineer, maybe a sysadmin but there needs to be someone with operational knowledge to manage, monitor, and optimize the cloud environment. It is true that experienced and certified experts are expensive in the US and EU but there are many options such as hiring a certified and experienced remote administrator.