Creating a New Orchestration
============================

You're working on a new perf&scale test project, and you want to have
it automated and running in the CI? Good! Do you already have you test
architecture in mind? And your toolbox is ready? Perfect, so we can
start building the orchestration!

Prepare the environment
-----------------------

To create an orchestration, go to ``projects/PROJECT_NAME/testing``
and prepare the following boilerplate code.

Mind that the ``PROJECT_NAME`` should be compatible with Python
packages (no ``-``) to keep things simple.


Prepare the ``test.py``, ``config.yaml`` and ``command_args.yaml.j2``
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

These files are all what is mandatory to have a configurable
orchestration layer.

* ``test.py`` should contain these entrypoints, for interacting with the CI:

::

  @entrypoint()
  def prepare_ci():
      """
      Prepares the cluster and the namespace for running the tests
      """

      pass


  @entrypoint()
  def test_ci():
      """
      Runs the test from the CI
      """

      pass


  @entrypoint()
  def cleanup_cluster(mute=False):
      """
      Restores the cluster to its original state
      """
      # _Not_ executed in OpenShift CI cluster (running on AWS). Only required for running in bare-metal environments.

      common.cleanup_cluster()

      pass


  @entrypoint(ignore_secret_path=True, apply_preset_from_pr_args=False)
  def generate_plots_from_pr_args():
      """
      Generates the visualization reports from the PR arguments
      """

      visualize.download_and_generate_visualizations()

      export.export_artifacts(env.ARTIFACT_DIR, test_step="plot")


  class Entrypoint:
      """
      Commands for launching the CI tests
      """

      def __init__(self):

          self.prepare_ci = prepare_ci
          self.test_ci = test_ci
          self.cleanup_cluster_ci = cleanup_cluster
          self.export_artifacts = export_artifacts

          self.generate_plots_from_pr_args = generate_plots_from_pr_args

  def main():
      # Print help rather than opening a pager
      fire.core.Display = lambda lines, out: print(*lines, file=out)

      fire.Fire(Entrypoint())


  if __name__ == "__main__":
      try:
          sys.exit(main())
      except subprocess.CalledProcessError as e:
          logging.error(f"Command '{e.cmd}' failed --> {e.returncode}")
          sys.exit(1)
      except KeyboardInterrupt:
          print() # empty line after ^C
          logging.error(f"Interrupted.")
          sys.exit(1)

* ``config.yaml`` should contain

::

  ci_presets:
    # name of the presets to apply, or null if no preset
    name: null
    # list of names of the presets to apply, or a single name, or null if no preset
    names: null


    single:
      clusters.create.type: single

    keep:
      clusters.create.keep: true
      clusters.create.ocp.tags.Project: PSAP/Project/...
      # clusters.create.ocp.tags.TicketId:

    light_cluster:
      clusters.create.ocp.deploy_cluster.target: cluster_light

    light:
      extends: [light_cluster]
      ...

    ...

  secrets:
    dir:
      name: psap-ods-secret
      env_key: PSAP_ODS_SECRET_PATH
    # name of the file containing the properties of LDAP secrets
    s3_ldap_password_file: s3_ldap.passwords
    keep_cluster_password_file: get_cluster.password
    brew_registry_redhat_io_token_file: brew.registry.redhat.io.token
    opensearch_instances: opensearch.yaml
    aws_credentials: .awscred
    git_credentials: git-credentials

  clusters:
    metal_profiles:
      ...: ...
    create:
      type: single # can be: single, ocp, managed
      keep: false
      name_prefix: fine-tuning-ci
      ocp:
        # list of tags to apply to the machineset when creating the cluster
        tags:
          # TicketId: "..."
          Project: PSAP/Project/...
        deploy_cluster:
          target: cluster
        base_domain: psap.aws.rhperfscale.org
        version: 4.15.9
        region: us-west-2
        control_plane:
          type: m6a.xlarge
        workers:
          type: m6a.2xlarge
          count: 2

    sutest:
      is_metal: false
      lab:
        name: null
      compute:
        dedicated: true
        machineset:
          name: workload-pods
          type: m6i.2xlarge
          count: null
          taint:
            key: only-workload-pods
            value: "yes"
            effect: NoSchedule
    driver:
      is_metal: false
      compute:
        dedicated: true
        machineset:
          name: test-pods
          count: null
          type: m6i.2xlarge
          taint:
            key: only-test-pods
            value: "yes"
            effect: NoSchedule
    cleanup_on_exit: false

  matbench:
    preset: null
    workload: projects....visualizations...
    prom_workload: projects....visualizations....
    config_file: plots.yaml
    download:
      mode: prefer_cache
      url:
      url_file:
      # if true, copy the results downloaded by `matbench download` into the artifacts directory
      save_to_artifacts: false
    # directory to plot. Set by testing/common/visualize.py before launching the visualization
    test_directory: null
    lts:
      generate: true
      horreum:
        test_name: null
      opensearch:
        export:
          enabled: false
          enabled_on_replot: false
          fail_test_on_fail: true
        instance: smoke
        index: ...
        index_prefix: ""
        prom_index_suffix: -prom
      regression_analyses:
        enabled: false
        # if the regression analyses fail, mark the test as failed
        fail_test_on_regression: false
  export_artifacts:
    enabled: false
    bucket: rhoai-cpt-artifacts
    path_prefix: cpt/fine-tuning
    dest: null # will be set by the export code

* ``command_args.yml.j2`` should start with:

::

  {% set secrets_location = false | or_env(secrets.dir.env_key) %}
  {% if not secrets_location %}
    {{ ("ERROR: secrets_location must be defined (secrets.dir.name="+ secrets.dir.name|string +" or env(secrets.dir.env_key=" + secrets.dir.env_key|string + ")) ") | raise_exception }}
  {% endif %}
  {% set s3_ldap_password_location = secrets_location + "/" + secrets.s3_ldap_password_file %}

  # ---


Copy the ``clusters.sh`` and ``configure.sh``
"""""""""""""""""""""""""""""""""""""""""""""

These files are necessary to be able to create clusters on
OpenShift CI. (``/test rhoai-e2e``). They shouldn't be modified.

And now, the boiler-plate code is in place, and we can start building
the test orchestration.

Create ``test_....py`` and ``prepare_....py``
"""""""""""""""""""""""""""""""""""""""""""""

Starting at this step, the development of the test orchestration
starts, and you "just" have to fill the gaps :)


In the ``prepare_ci`` method, prepare your cluster, according to the
configuration. In the ``test_ci`` method, run your test and collect
its artifacts. In the ``cleanup_cluster_ci``, cleanup you cluster, so
that it can be used again for another test.

Start building your test orchestration
--------------------------------------

One the boilerplate code is in place, we can start building the test
orchestration. TOPSAIL provides some "low level" helper modules:

::

  from projects.core.library import env, config, run, configure_logging, export

as well as libraries of common orchestration bits:

::

  from projects.rhods.library import prepare_rhoai as prepare_rhoai_mod
  from projects.gpu_operator.library import prepare_gpu_operator
  from projects.matrix_benchmarking.library import visualize


These libraries are illustrated below. They are not formally described
at the moment. They come from project code blocks that have noticed to
be used identically across projects, so they have been moved to
library directories to be easier to reuse.

Sharing code across projects means extending the risk of unnoticed
bugs when updating the library. With this in mind, the question of
code sharing vs code duplication takes another direction, as extensive
testing is not easy in such a rapidly evolving project.


Core helper modules
"""""""""""""""""""

The ``run`` module
''''''''''''''''''

* helper functions to run system commands, toolbox commands, and
  ``from_config`` toolbox commands:

::

   def run(command, capture_stdout=False, capture_stderr=False, check=True, protect_shell=True, cwd=None, stdin_file=None, log_command=True)

This method allows running a command, capturing or not its
stdout/stderr, checking it's return code, chaning it's working
directory, protecting it with bash safety flags (``set -o
errexit;set -o pipefail;set -o nounset;set -o errtrace``), passing a
file as stdin, logging or not the command, ...

::

   def run_toolbox(group, command, artifact_dir_suffix=None, run_kwargs=None, mute_stdout=None, check=None, **kwargs)

This command allows running a toolbox command. ``group, command,
kwargs`` are the CLI toolbox command arguments.  ``run_kwargs`` allows
passing arguments directory to the ``run`` command described
above. ``mute_stdout`` allows muting (capturing) the stdout
text. ``check`` allows disabling the exception on error
check. ``artifact_dir_suffix`` allows appending a suffix to the
toolbox directory name (eg, to distinguish two identical calls in the
artifacts).

::

   def run_toolbox_from_config(group, command, prefix=None, suffix=None, show_args=None, extra=None, artifact_dir_suffix=None, mute_stdout=False, check=True, run_kwargs=None)

This command allows running a toolbox command with the ``from_config``
helper (see the description of the ``command_args.yaml.j2``
file). ``prefix`` and ``suffix`` allow distinguishing commands in the
``command_args.yaml.j2`` file. ``extra`` allows passing extra
arguments that override what is in the template file. ``show_args``
only display the arguments that would be passed to ``run_toolbox.py``.z

* ``run_and_catch`` is an helper function for chaining multiple
  functions without swallowing exceptions:

::

    exc = None
    exc = run.run_and_catch(
      exc,
      run.run_toolbox, "kserve", "capture_operators_state", run_kwargs=dict(capture_stdout=True),
    )

    exc = run.run_and_catch(
      exc,
      run.run_toolbox, "cluster", "capture_environment", run_kwargs=dict(capture_stdout=True),
    )

    if exc: raise exc

* helper context to run functions in parallel. If
  ``exit_on_exception`` is set, the code will exit the process when an
  exception is catch. Otherwise it will simply raise it. If
  ``dedicated_dir`` is set, a dedicated directly, based on the
  ``name`` parameter, will be created.

::

    class Parallel(object):
        def __init__(self, name, exit_on_exception=True, dedicated_dir=True):

Example:

::

    def prepare():
      with run.Parallel("prepare1") as parallel:
          parallel.delayed(prepare_rhoai)
          parallel.delayed(scale_up_sutest)


      test_settings = config.project.get_config("tests.fine_tuning.test_settings")
      with run.Parallel("prepare2") as parallel:
          parallel.delayed(prepare_gpu)
          parallel.delayed(prepare_namespace, test_settings)

      with run.Parallel("prepare3") as parallel:
          parallel.delayed(preload_image_yyy)
          parallel.delayed(preload_image_xxx)
          parallel.delayed(preload_image_zzz)


The ``env`` module
''''''''''''''''''

* ``ARTIFACT_DIR`` thread-safe access to the storage directory. Prefer
  using this than ``$ARTIFACT_DIR`` which isn't thread safe.

* helper context to create a dedicated artifact directory. Based on
  OpenShift CI, TOPSAIL relies on the ``ARTIFACT_DIR`` environment
  variable to store its artifacts. Each toolbox command creates a new
  directory name ``nnn__group__command``, which keeps the directories
  ordered and easy to follow. However, when many commands are executed,
  sometimes in parallel, the number of directories increase and becomes
  hard to understand. This command allows creating subdirectories, to
  group things logically:

Example:

::

    with env.NextArtifactDir("prepare_namespace"):
        set_namespace_annotations()
        download_data_sources(test_settings)

The ``config`` module
'''''''''''''''''''''

* the ``config.project.get_config(<config key>)`` helper command to
  access the configuration. Uses the inline Json format.  This object
  holds the main project configuration.

* the ``config.project.set_config(<config key>, <value>)`` helper
  command to update the configuration. Sometimes, it is convenient to
  store values in the configuration (eg, coming from the
  command-line). Mind that this is not thread-safe (an error is raised
  if this command is called in a ``run.Parallel`` context). Mind that
  this command does not allow creating new configuration fields in the
  document. Only existing fields can be updated.


The ``projects.rhods.library.prepare_rhoai`` library module
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

This library helps with the deployment of RHOAI pre-builds on OpenShift.

* ``install_servicemesh()`` installs the ServiceMesh Operator, if not
  already installed in the cluster (this is a dependency of RHOAI)

* ``uninstall_servicemesh(mute=True)`` uninstall the ServiceMesh
  Operator, if it is installed

* ``is_rhoai_installed()`` tells if RHOAI is currently installed or
  not.

* ``install(token_file=None, force=False)`` installs RHOAI, if it is
  not already installed (unless ``force`` is passed). Mind that the
  current deployment code only works with the pre-builds of RHOAI,
  which require a Brew ``token_file``. If the token isn't passed, it
  is assumed that the cluster already has access to Brew.


The ``projects.gpu_operator.library.prepare_gpu_operator`` library module
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

This library helps with the deployment of the GPU stack on OpenShift.

* ``prepare_gpu_operator()`` deploys the NFD Operator and the GPU
  Operator, if they are not already installed.

* ``wait_ready(...)`` waits for the GPU Operator stack to be deployed,
  and optionally enable additional GPU Operator features:

    * ``enable_time_sharing`` enables the time-sharing capability of
      the GPU Operator, (configured via the ``command_args.yaml.j2``
      file).
    * ``extend_metrics=True, wait_metrics=True`` enables extra metrics
      to be captured by the GPU Operator DCGM component (the
      "well-known" metrics set). If ``wait_metrics`` is enabled, the
      automation will wait for the DCGM to start reporting these
      metrics.
    * ``wait_stack_deployed`` allows disabling the final wait, and
      only enable the components above.

* ``cleanup_gpu_operator()`` undeploys the GPU Operator and the NFD
  Operator, if they are deployed.

* ``add_toleration(effect, key)`` adds a toleration to the GPU
  Operator DaemonSet Pods. This allows the GPU Operator Pods to be
  deployed on nodes with specific taints. Mind that this command
  overrides any toleration previously set.

The ``projects.local_ci.library.prepare_user_pods`` library module
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

This library helps with the execution of multi-user TOPSAIL tests.

Multi-user tests consist in Pods running inside the cluster, and
all executing a TOPSAIL command. Their initialization is synchronized
with a barrier, then they wait a configurable delay before starting
their script. When they terminate, their file artifacts are collected via a
S3 server, and stored locally for post-processing.

* ``prepare_base_image_container(namespace)`` builds a TOPSAIL image
  in a given namespace. The image must be consistent with the commit
  of TOPSAIL being tested, so the ``BuildConfig`` relies on the PR
  number of fetch the right commit. The ``apply_prefer_pr`` function
  provides the helper code to update the configuration with the number
  of the PR being tested.

* ``apply_prefer_pr(pr_number=None)`` inspects the environment to
  detect the PR number. When running locally, export
  ``HOMELAB_CI=true`` and ``PULL_NUMBER=...`` for this function to
  automatically detect the PR number. Mind that this function updates
  the configuration file, so it cannot run inside a parallel context.

* ``delete_istags(namespace)`` cleanups up the istags used by TOPSAIL
  User Pods.

* ``rebuild_driver_image(namespace, pr_number)`` helps refreshing the
  image when running locally.

::

    @entrypoint()
    def rebuild_driver_image(pr_number):
        namespace = config.project.get_config("base_image.namespace")
        prepare_user_pods.rebuild_driver_image(namespace, pr_number)

* ``cluster_scale_up(user_count)`` scales up the cluster with the
  right number of nodes (when not running in a bare-metal cluster).

* ``prepare_user_pods(user_count)`` prepares the cluster for running a
  multi-user scale test. Deploys the dependency tools (minio, redis),
  builds the image, prepare the ServiceAccount that TOPSAIL will use,
  prepare the secrets that TOPSAIL will have access to ...

* ``cleanup_cluster()`` cleanups up the cluster by deleting the User
  Pod namespace.

The ``projects.matrix_benchmarking.library.visualize`` library module
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

This module helps with the post-processing of TOPSAIL results.

* ``prepare_matbench()`` is called from the ContainerFile. It
  installs the ``pip`` dependencies of MatrixBenchmarking.

* ``download_and_generate_visualizations(results_dirname)`` is called
  from the CIs, when replotting. It downloads test results runs the
  post-processing steps against it.

* ``generate_from_dir(results_dirname, generate_lts=None)`` is the
  main entrypoint of this library. It accepts a directory as argument,
  and runs the post-processing steps against it. The expected
  configuration should be further documented ...