Running GPU-based container applications with Amazon ECS Anywhere
By: Date: 10/10/2021 Categories: amazonwebservices,AWS Tags:

Tens of thousands of customers have already migrated their on-premises workloads to the cloud for the past decade, however we’ve also seen a number of workloads that are not simply able to move to the cloud. Rather, those workloads are needed to remain on-premise due to data residency, network latency, regulatory, or compliance considerations.

Back in May 2021, Amazon Elastic Container Service (Amazon ECS) announced the general availability of Amazon ECS Anywhere (ECS Anywhere) to solve the use cases described earlier, as a simplified way for customers to run and manage containerized applications on-premises. ECS Anywhere added a new “EXTERNAL” container launch type to Amazon ECS in addition to existing EC2 and FARGATE launch types. With this new capability, customers are now able to run containers in their own compute hardware using the Amazon ECS APIs in the AWS Region, without running and operating their own container orchestrators.

GPU-based workloads now enabled with ECS Anywhere

While Amazon ECS and ECS Anywhere enable customers to easily leverage their own hardware to solve problems using containers across their hybrid footprint, customers still needed to find another way other than ECS Anywhere to run GPU-based container workloads in their data centers – until today. Now Amazon ECS supports GPU-based container workloads with ECS Anywhere, and this enables customers to run those GPU-based container workloads with the same experience they do with Amazon ECS in the AWS Region.

GPUs today are widely used in various areas, not only for machine learning but for 3D visualization, image processing, and big data workloads for instance. With Amazon ECS Anywhere GPU support, you can now simply run and manage those workloads using containers in their data centers without the need to transfer data to the cloud, and the need to operate their own container orchestrators for those workloads. This is also great news for customers who have made significant investments in their on-premises GPUs, because now they can use ECS Anywhere to make use of their existing GPU investment removing the operational overhead they have today with current toolsets like Docker Swarm or Kubernetes.

Walk-through ECS Anywhere with GPU support

Let’s briefly walk-through the new ECS Anywhere capability step by step. We’re first going to 1) obtain a registration command, then 2) register a machine with a GPU device to an existing Amazon ECS cluster. Next we will 3) register a simple Amazon ECS task definition, and finally 4) run an Amazon ECS task in the external machine through the Amazon ECS APIs.

In the following steps I’m going to use my empty Amazon ECS cluster named “ecsAnywhere-gpu”, but of course you can use any existing ECS cluster to follow the steps.

1. Obtaining a registration command for an external instance

At this moment there is no registered ECS instance in this Amazon ECS cluster as you can see in the screenshot below. To register external instances to Amazon ECS cluster, the first thing you will do is selecting Register External Instances. Note that the Amazon ECS console version 1 supports ECS Anywhere today. Make sure you’re using the console with the top left checkbox New ECS Experience turned off.

Registering external instance to ECS cluster

Now you see a dialog window something like the following figure. The underlined Instance role “ecsExternalInstanceRole” is an IAM role for external instances which I have created beforehand based on the steps in the ECS Anywhere documentation. Make sure you have created one before proceeding to the next step.

I’m going to use the default values for Activation key duration and Number of instances here, but you can change them based on your needs. See the details in the ECS Anywhere documentation.

Select Next step, then you will get a registration command in the next view, which you will execute in your external machine to register it to the Amazon ECS cluster.

The first step to register an external instance

Copy the registration command shown in the dialog window, and paste it in your text editor to add the --enable-gpu flag at the end of the registration command as follows. Before proceeding to the next step, you may want to close the dialog window by selecting Done in the ECS management console.

# NOTE: I added few line breaks for making this easier to read
curl --proto "https" \
  -o "/tmp/ecs-anywhere-install.sh" "https://amazon-ecs-agent.s3.amazonaws.com/ecs-anywhere-install-latest.sh" \
  && bash /tmp/ecs-anywhere-install.sh \
    --region "us-east-1" \
    --cluster "ecsAnywhere-gpu" \
    --activation-id "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxx" \
    --activation-code "XXXXxxxxXXXXxxxxXXXX" \
    --enable-gpu # ADD THIS!

Bash

2. Execute the registration command in the external machine

Once you get into a shell in your external machine, just execute the registration command you edited in the previous step, as root. It automatically set ups and registers your machine, then the machine will show up as an external ECS instance in the ECS management console.

After a few minutes, you may now see “You can check your ECS cluster here <ECS Console URL>” in the command output as below. If you see the command fails, please check the documentation to see the network requirements and the supported operating systems in the ECS Anywhere documentation. Also, Amazon ECS and ECS Anywhere today supports the Nvidia kernel drivers and Docker GPU runtimes to schedule Amazon ECS tasks. To install and configure those required components in your external machine, please read the guide from Nvidia (this one is for Tesla, for example).

Registration command output

Let’s open the URL to see what’s happening on your Amazon ECS cluster now.

External instance joined

Boom! Now Amazon ECS is aware of your external GPU machine as capacity for the Amazon ECS cluster to run your Amazon ECS tasks ?

3. Register a sample Amazon ECS task definition

The final step before running an Amazon ECS task, is registering a sample Amazon ECS task definition below. As you can see the Amazon ECS task you’re going to deploy will use the nvidia/cuda container image and run the famous nvidia-smi command (it simply prints some information about GPU devices in your machine and then exits immediately) to make sure the task is using a GPU device in your external machine.

{
  "containerDefinitions": [
    {
      "memory": 200,
      "essential": true,
      "name": "cuda",
      "image": "nvidia/cuda:11.0-base",
      "resourceRequirements": [{
        "type":"GPU",
        "value": "1"
      }],
      "command": [
        "sh", "-c", "nvidia-smi"
      ],
      "cpu": 100
    }
  ],
  "family": "example-ecs-anywhere-gpu"
}

JSON

Open the task definition window in the ECS management console, then select Create new Task Definition to register the above JSON.

Selecting "Create new Task Definition"

Choose EXTERNAL, then select Next step.

Selecting launch type for task definition

Scroll the view down to the bottom in the next window, and select Configure via JSON, then you’ll see a dialog window to configure this task definition via JSON directly.

Selecting "Configure via JSON"

Paste the sample Amazon ECS task definition JSON described earlier in this section, then select Save at the right bottom.

Configuring via JSON

Then, select Create to register your Amazon ECS task definition, then you’ll see a message like “Created Task Definition successfully” in the top of the window.

Register task definition

4. Run a GPU-based Amazon ECS task in the external ECS instance

Finally, you’re now ready to run your first GPU-based Amazon ECS task in the external ECS instance. Select Clusters in the side bar in the ECS management console, then select your Amazon ECS cluster in the cluster list.

Open ECS cluster

In the cluster view you opened, select the Tasks tab and then Run new task.

Run new task

In the next window, ensure you chose EXTERNAL and the task definition “example-ecs-anywhere-gpu” you’ve just registered in the previous step, then select Run Task at the right bottom of the window.

Run task with specified task definition

It will redirect you to the Amazon ECS cluster window, and you may find there is one Amazon ECS task scheduled onto the external ECS instance as shown in the screenshot below.

Task successfully scheduled

Let’s select the Task ID (d2e6e7… in this case) as highlighted in the above screenshot, to see the Amazon ECS task details. It may still show PENDING at the time you opened the next window, but it’ll be soon updated to RUNNING and STOPPED. You may also find that the container successfully ran the nvidia-smi command and exited with Exit Code “0” as shown below.

Task successfully ran and stopped

If you want to see the actual output from the nvidia-smi command, execute docker ps -a in the external instance to find container ID, and then execute docker logs <container ID> to see the output as container logs like below. In production workloads, you may also want to configure an Amazon ECS task execution IAM role and the awslogs log driver for example for your Amazon ECS task, to collect and view your container logs in the AWS Region instead of SSH into the external instances.

nvidia-smi command output