Wednesday, June 5, 2024

Science Logic: Managing IT Costs in a Multi-Cloud Era

Organizations with large IT infrastructures, particularly in this multi-cloud-platform era, have difficulty with cost-control. Seeing what are the costs, and where, in their infrastructure, is one thing, but then identifying the inefficiencies in their networks is another thing entirely.

Why?

On paper, it sounds reasonable for IT to know what its infrastructure is, and what resources are committed to which projects. However, the realities of large corporations make having this information, clearly delineated and at your fingertips somewhat of a pipe-dream.

What are the barriers to knowing IT infrastructure for large corporations?

Case Study



FCC / US Federal Communications Commission

The FCC employs approximately 1,500 people, of which less than 100 people make up their OET / Office of Engineering and Technology. This, on the face of it, is not a large corporate structure, but this is counterbalanced that nearly everyone in the FCC is working with or supporting IT infrastructure and has a USA-wide customer-base making requests for radio-frequency (RF) bands that are serviced by tools created by and maintained by OET.

Does OET know every project on every network of networks that the FCC maintains and pays for?

Short answer: No. Not even close.

The infrastructure-problems the FCC faces mirror those problems many corporations with large IT infrastructures face: projects in various states of development, projects that have been abandoned don't have all their resources accounted for and terminated, projects in production are in various versioning-states. Some of these projects were started and maintained by OET personnel, but some, many, are not: some are contracted out, some are handled by other offices within FCC.

Certainly, the FCC, like other large corporations publish policies and have terms of use for resources created and maintained within their infrastructure, but how to promulgate and to enforce these policies when OET has only so many (few, actually) people, and so much of their work is piling up in their backlog, e.g.: there are requests for RF bands from external organizations that have been pending for more than a year.

IT infrastructure teams are buried in work, tasked beyond their capabilities, and have almost no time to manage infrastructure costs, because how does on even distinguish that a resource is being used productively or is simply idling, costing the tax-payers money, day-after-day-after-month-after-year?

How much are these costs? In one internal audit the FCC conducted, it was found that 1,500 AWS EC2s had been idling for months on end. The auditors terminated 750 EC2s with no negative impact to the FCC's IT infrastructure, at a cost-savings of $10,000 per month.

The internal audit involved 2 dedicated resources over a period of a month.

Case Study



Commercial Client using Google Cloud

A similar scenario occurred where I did an internal audit for a large commercial client (with over 400,000 employees) with multi-cloud capabilities for their Google Cloud Platform.

Myself and one of their employees had a very specific domain to review all Google Cloud resources in development, identify and terminate all unused resources.

Our team, over a period of three weeks, identified unused resources, got confirmation from management that these resources were, indeed, unnecessary, and terminated 42 projects that entailed networks, cloud functions, and pubsub topics. These project had been idling for more than a year and their estimated running costs were $1,400 per month per project.

Simply shutting down these 42 projects saved our client ~ $60,000 per month or over $700,000 per annum.

Our study of, granted, very limited scope, had 2 dedicated resources and took 3 weeks to complete.

The Issues with Mandated IT Infrastructure Audits

The above approach, ...

  • studying the network, 
  •  building a table of the committed resources, 
  •  then investigating which resources are engaged and which are idle, 

 ... is the traditional approach to network maintenance, but this traditional approach has several crippling problems:

  • These studies are often conducted after a problem is noticed, and by that time, the serious hemorrhaging of funding has already occurred and continues to occur.
  • These studies often require consensus-building with management and engineers which often entails guess-work around which resources are necessary and which can be shut down, so, there's always a risk that a necessary system will be terminated, throwing IT into a panic to restore services. So, instead of facing that fear, people often overcompensate, leaving moribund resources running ... "because it might be needed, ... somewhere, ... maybe, ... and nobody knows how to start it back up if we lose it because the guy who maintained it retired 12 years ago."
  • These studies fail to put into place automation that will stop the bleeding, so: yes, $10,000 or $60,000 per month were saved, but what's to stop new resources being spun up that replace, then overtake, those cost-savings? Nothing. So, new studies, starting from scratch are necessary, but usually never put on a cadence, so that only the next cost-emergency signals a new study.

How can we introduce automation, across the board, to assist IT departments in their network maintenance?

Science Logic



Network and Device Discovery





Network Load Monitoring



AI Agents / Automated Responses


No comments:

Post a Comment