Welcome to AI/ML Brain Food Part 2 – What’s on the menu. This second article follows up on the previous part, AI/ML Brain Food – Part 1: Where to Start?
We will take the foundation from before and layer on top a question – “What tools are available to help me prepare for AI/ML workloads?”
Today, we’re focusing on getting your on-prem private/hybrid cloud ready for AI/ML type workloads. Most of this logic could be applied to other HPC type workloads too and some of this might be worth checking out as just a good idea in general!
Performance, Quality and Control are the name of the game here. The performance and the quality are essential for the AI/ML output to become useful. Control is going to allow you to scale effectively, keep costs at bay and make sure the right guardrails are in place from a compliance perspective. Connectivity also becomes very important as you distribute the workload across different clouds/regions etc. As with any workload, you want it to work in production, not just on an isolated island, but connected to your existing apps and databases. I still need to secure the workloads, protect them and also provide them with resources. The major difference is that the amount of data will explode, because as we remember from the previous post, data is key to AI/ML
This is where PrepOps comes in.
AI/ML Preppers List – What vitamins can I take?
In Part 1, we talked about Vitamins as the extra POWER we can give our workloads.
This non-exhaustive list will change depending on many things, and I encourage everyone to come up with their own as they do the research, but depending on how far you want to take things, here’s some things you can remind yourself of before embarking on A.I projects.
- GPUs – There will likely be the need for some, many, or racks of GPUs depending on your goal. GPUs can process multiple computations at the same time, dramatically improving performance. A killer vitamin for AI/ML.
- SDS – Software Defined Storage – You’ll likely want to look at more, faster, software defined storage to store the vast amount of data required to train algorithms. Cloud Storage is another option, but if keeping on-prem, you will want something with an API (software defined).
- SmartNIC – You might want to offload some of the network traffic and management to new smart-nics, reducing load on the CPUs
- ASICs – Some workloads require, or benefit from specialised application specific integrated circuits, such as Google’s Tensor Processing Units.
Quality and Control
- Data – A high volume of data will inevitably require a lot of storage and high-quality data requires resilience and structure.
- Connectivity – You need to be able to SECURELY access the data and connect it to other parts of the application. How long does a network change take in your organization? Network and Security automation could be the key to dramatically increasing the agility of the environment.
- Governance – Where is the data going to come from, and what are the relevant regulatory constraints to consider in its use? Guardrails could be the thing that keeps you out of prison.
- Containerisation – You’ll want to be able to quickly innovate and iterate on any project related apps. A microservices based architecture is currently the best way to approach anything with a highly distributed nature. This approach also provides the added benefit of workload portability too.
How VMware can help
Having already mentioned the great work VMware is doing around providing the fastest most reliable virtualization platform for running deep learning models, you can see that VMware is positioning itself to make running AI/ML workloads as easy and reliable as running any other workloads.
Beyond that, the VMware ecosystem of solutions can already significantly help to accelerate the time to value of your projects.
Private Cloud (Modernise I.T)
If you’re thinking of running AI/ML workloads on premises, then the first step should be to turn your DC into a private cloud (if you haven’t already done so).
This means your DC becomes Self Service, Elastic and Metered. If you have a solid virtualization layer, but don’t have a true private cloud, it’s important to do this first. If not, you will still be spending all of your time running around dealing with discrete infrastructure components. With a private cloud, you can effectively carve up your infrastructure resources (Compute, Network, Storage, Kubernetes) into pools, provide self-service for developers to access servers, containers, databases, storage, networks, anything they might need to develop this workload.
VMware Cloud Foundation with vRealize is a hybrid cloud platform for managing VMs and orchestrating containers, built on full-stack hyperconverged infrastructure (HCI) technology. VMware Cloud Foundation enables consistent, secure infrastructure and operations across private and public cloud.
- Operate like a public cloud and speed up service delivery with self-service provisioning, automated performance and responsive capacity management.
- Increase provisioning speed with programmable provisioning. Enforce repeatable and reliable infrastructure with Infrastructure as Code.
- Continuously optimize app performance with AI-driven automated workload optimization.
- Reduce downtime, improve efficiency, gain end-to-end visibility, and manage risk
The nature of AI/ML workloads is typically a distributed one (remember the Self-Driving car analogy?), so you might find yourself using different components from different clouds. With VMware vRealize Suite, we enable consistent deployment and operations of your apps, infrastructure, and platform services, from the data centre to the cloud to the edge. vRealize Cloud Management helps you accelerate innovation, gain efficiency and improve control while mitigating risk so that you can spend time focusing on the actual apps, not the cloud platform. vRealize Cloud Management is available both on premises and SaaS and comes in several packages to meet the unique requirements of your hybrid cloud.
- Accelerate hybrid and public cloud innovation with consistent operations across clouds.
- Confidently migrate applications with a full view of component dependencies, network requirements, and security posture.
- Choose the most cost-effective deployment environment across private and public clouds and gain visibility across all business-critical apps no matter where they live.
- Consistently deploy applications to any endpoint and ensure ongoing compliance with your security policies.
Self Driving Data centre (with vRealize AI)
I’ve mentioned VMware’s own use of AI/ML with vRealize AI to supercharge your data centre a few times in this article. It’s a great example of an AI/ML service, but also a great way to prepare your own data centre for your own high performance or AI/ML type workloads. vRealize AI is actually the first artificial intelligence (AI) and machine learning (ML) solution to optimize infrastructure operations. Through data collections and reinforcement learning techniques, vRealize AI Cloud continuously optimizes your configured KPIs while factoring in the dynamic nature of traditional and modern applications.
This Infographic gives a brilliant overview of the technology, use cases and benefits.
I’d also encourage everyone to have a look at this blog post from David Pham: Realize the AI / ML Fundamentals of the Self-Driving Datacenter with vRealize AI Cloud.
Thanks for reading this 2-part article. Hopefully you found the brain food nourishing!