Configure and Run Workflow for AWS Batch

You will need three values from the Setup and Configure AWS Batch guide to configure the workflow:

  1. The path to the S3 bucket that was set up. This will be referred to as BUCKET_PATH below.

  2. The logs group that was set up. This will be referred to as LOGS_GROUP below.

  3. The job queue name. This will be referred to as JOB_QUEUE below.

Follow the directions below to configure and run your workflow using AWS Batch.

Configure Workflow

Each of the sets of documentation for the different TEI-REX workflows (see: https://mriffle.github.io/teirex-workflows/) includes a description of the pipeline.config file for that workflow. Regardless of which workflow you are running, the pipeline.config file will need the same modifications in order to run the workflow on AWS. We recommend making these modifications once and re-using the same pipeline.config for the respective workflow, making modifications as necessary for the specific data being searched.

First, add the following block to the end of your pipeline.config file. Note, you will need to substitute in the value for LOGS_GROUP (described above). If you configured your AWS Batch in a region other than us-west-2, you should update the value for region below to match your region.

aws {

    batch {
        // NOTE: this setting is only required if the AWS CLI tool is installed in a custom AMI
        cliPath = '/usr/local/aws-cli/v2/current/bin/aws'
        logsGroup = 'LOGS_GROUP'
        maxConnections = 20
        connectionTimeout = 10000
        uploadStorageClass = 'INTELLIGENT_TIERING'
        storageEncryption = 'AES256'
        retryMode = 'standard'
    }

    region = 'us-west-2'
}

Next, update the profiles section of pipeline.config to include an aws section. Before adding the aws section, this will appearly similarly to:

profiles {

    // "standard" is the profile used when the steps of the workflow are run
    // locally on your computer. These parameters should be changed to match
    // your system resources (that you are willing to devote to running
    // workflow jobs).
    standard {
        params.max_memory = '8.GB'
        params.max_cpus = 4
        params.max_time = '240.h'

        params.mzml_cache_directory = '/data/mass_spec/nextflow/nf-teirex-dda/mzml_cache'
        params.panorama_cache_directory = '/data/mass_spec/nextflow/panorama/raw_cache'
    }
}

After adding the aws section, this will appear similarly to:

profiles {

    // "standard" is the profile used when the steps of the workflow are run
    // locally on your computer. These parameters should be changed to match
    // your system resources (that you are willing to devote to running
    // workflow jobs).
    standard {
        params.max_memory = '8.GB'
        params.max_cpus = 4
        params.max_time = '240.h'

        params.mzml_cache_directory = '/data/mass_spec/nextflow/nf-teirex-dda/mzml_cache'
        params.panorama_cache_directory = '/data/mass_spec/nextflow/panorama/raw_cache'
    }

    // "aws" profile -- parameters used for running jobs on AWS Batch
    aws {
        process.executor = 'awsbatch'
        process.queue = 'JOB_QUEUE'

        // params for running pipeline on aws batch
        // These can be overridden in local config file

        // max params allowed for your AWS Batch compute environment
        params.max_memory = '124.GB'
        params.max_cpus = 32
        params.max_time = '240.h'

        // where to cache mzml files after running msconvert
        params.mzml_cache_directory = 's3://BUCKET_PATH/mzml_cache'
        params.panorama_cache_directory = 's3://BUCKET_PATH/panorama_cache'
    }
}

Replace JOB_QUEUE and BUCKET_PATH with the values described above. A description of each parameter is below:

  • process.executor - This instructs Nextflow which executor to use when the aws profile is used. Do not change.

  • process.queue - This is the job queue set up for AWS Batch (see: Setup and Configure AWS Batch guide).

  • params.max_memory - This is the maximum amount of memory a task may use in the workflow. This should not exceed the maximum memory available in the hardware profiles for the AWS Batch compute environment.

  • params.max_cpus - This is the maximum number of cores a task may use in the workflow. This should not exceed the maximum cores available in the hardware profiles for the AWS Batch compute environment.

  • params.max_time - This is the maximum amount of time (in hours) a task may be run before being automatically killed and retried.

  • params.mzml_cache_directory - This is the path to the cache of mzML files created by processing raw files.

  • params.panorama_cache_directory - This is the path to the cache of raw files created by downloading raw files from PanoramaWeb. If not using PanoramaWeb a value is still required, but will never be used.

Run Workflow

Once all the AWS setup and configuration is complete, running the workflow on AWS is very simple: merely specify aws as the profile and specify the S3 work directory on the command line. For example:

  • For the DDA workflow:

nextflow run -resume -r main -profile aws mriffle/nf-teirex-dda -bucket-dir s3://BUCKET_PATH/work -c pipeline.config
  • For the DIA workflow:

nextflow run -resume -r main -profile aws mriffle/nf-teirex-dia -bucket-dir s3://BUCKET_PATH/work -c pipeline.config

Note

Replace BUCKET_PATH with the S3 bucket you set up in the Setup and Configure AWS Batch guide.