Creating a New Orchestration ============================ You're working on a new perf&scale test project, and you want to have it automated and running in the CI? Good! Do you already have you test architecture in mind? And your toolbox is ready? Perfect, so we can start building the orchestration! Prepare the environment ----------------------- To create an orchestration, go to ``projects/PROJECT_NAME/testing`` and prepare the following boilerplate code. Mind that the ``PROJECT_NAME`` should be compatible with Python packages (no ``-``) to keep things simple. Prepare the ``test.py``, ``config.yaml`` and ``command_args.yaml.j2`` """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" These files are all what is mandatory to have a configurable orchestration layer. * ``test.py`` should contain these entrypoints, for interacting with the CI: :: @entrypoint() def prepare_ci(): """ Prepares the cluster and the namespace for running the tests """ pass @entrypoint() def test_ci(): """ Runs the test from the CI """ pass @entrypoint() def cleanup_cluster(mute=False): """ Restores the cluster to its original state """ # _Not_ executed in OpenShift CI cluster (running on AWS). Only required for running in bare-metal environments. common.cleanup_cluster() pass @entrypoint(ignore_secret_path=True, apply_preset_from_pr_args=False) def generate_plots_from_pr_args(): """ Generates the visualization reports from the PR arguments """ visualize.download_and_generate_visualizations() export.export_artifacts(env.ARTIFACT_DIR, test_step="plot") class Entrypoint: """ Commands for launching the CI tests """ def __init__(self): self.prepare_ci = prepare_ci self.test_ci = test_ci self.cleanup_cluster_ci = cleanup_cluster self.export_artifacts = export_artifacts self.generate_plots_from_pr_args = generate_plots_from_pr_args def main(): # Print help rather than opening a pager fire.core.Display = lambda lines, out: print(*lines, file=out) fire.Fire(Entrypoint()) if __name__ == "__main__": try: sys.exit(main()) except subprocess.CalledProcessError as e: logging.error(f"Command '{e.cmd}' failed --> {e.returncode}") sys.exit(1) except KeyboardInterrupt: print() # empty line after ^C logging.error(f"Interrupted.") sys.exit(1) * ``config.yaml`` should contain :: ci_presets: # name of the presets to apply, or null if no preset name: null # list of names of the presets to apply, or a single name, or null if no preset names: null single: clusters.create.type: single keep: clusters.create.keep: true clusters.create.ocp.tags.Project: PSAP/Project/... # clusters.create.ocp.tags.TicketId: light_cluster: clusters.create.ocp.deploy_cluster.target: cluster_light light: extends: [light_cluster] ... ... secrets: dir: name: psap-ods-secret env_key: PSAP_ODS_SECRET_PATH # name of the file containing the properties of LDAP secrets s3_ldap_password_file: s3_ldap.passwords keep_cluster_password_file: get_cluster.password brew_registry_redhat_io_token_file: brew.registry.redhat.io.token opensearch_instances: opensearch.yaml aws_credentials: .awscred git_credentials: git-credentials clusters: metal_profiles: ...: ... create: type: single # can be: single, ocp, managed keep: false name_prefix: fine-tuning-ci ocp: # list of tags to apply to the machineset when creating the cluster tags: # TicketId: "..." Project: PSAP/Project/... deploy_cluster: target: cluster base_domain: psap.aws.rhperfscale.org version: 4.15.9 region: us-west-2 control_plane: type: m6a.xlarge workers: type: m6a.2xlarge count: 2 sutest: is_metal: false lab: name: null compute: dedicated: true machineset: name: workload-pods type: m6i.2xlarge count: null taint: key: only-workload-pods value: "yes" effect: NoSchedule driver: is_metal: false compute: dedicated: true machineset: name: test-pods count: null type: m6i.2xlarge taint: key: only-test-pods value: "yes" effect: NoSchedule cleanup_on_exit: false matbench: preset: null workload: projects....visualizations... prom_workload: projects....visualizations.... config_file: plots.yaml download: mode: prefer_cache url: url_file: # if true, copy the results downloaded by `matbench download` into the artifacts directory save_to_artifacts: false # directory to plot. Set by testing/common/visualize.py before launching the visualization test_directory: null lts: generate: true horreum: test_name: null opensearch: export: enabled: false enabled_on_replot: false fail_test_on_fail: true instance: smoke index: ... index_prefix: "" prom_index_suffix: -prom regression_analyses: enabled: false # if the regression analyses fail, mark the test as failed fail_test_on_regression: false export_artifacts: enabled: false bucket: rhoai-cpt-artifacts path_prefix: cpt/fine-tuning dest: null # will be set by the export code * ``command_args.yml.j2`` should start with: :: {% set secrets_location = false | or_env(secrets.dir.env_key) %} {% if not secrets_location %} {{ ("ERROR: secrets_location must be defined (secrets.dir.name="+ secrets.dir.name|string +" or env(secrets.dir.env_key=" + secrets.dir.env_key|string + ")) ") | raise_exception }} {% endif %} {% set s3_ldap_password_location = secrets_location + "/" + secrets.s3_ldap_password_file %} # --- Copy the ``clusters.sh`` and ``configure.sh`` """"""""""""""""""""""""""""""""""""""""""""" These files are necessary to be able to create clusters on OpenShift CI. (``/test rhoai-e2e``). They shouldn't be modified. And now, the boiler-plate code is in place, and we can start building the test orchestration. Create ``test_....py`` and ``prepare_....py`` """"""""""""""""""""""""""""""""""""""""""""" Starting at this step, the development of the test orchestration starts, and you "just" have to fill the gaps :) In the ``prepare_ci`` method, prepare your cluster, according to the configuration. In the ``test_ci`` method, run your test and collect its artifacts. In the ``cleanup_cluster_ci``, cleanup you cluster, so that it can be used again for another test. Start building your test orchestration -------------------------------------- One the boilerplate code is in place, we can start building the test orchestration. TOPSAIL provides some "low level" helper modules: :: from projects.core.library import env, config, run, configure_logging, export as well as libraries of common orchestration bits: :: from projects.rhods.library import prepare_rhoai as prepare_rhoai_mod from projects.gpu_operator.library import prepare_gpu_operator from projects.matrix_benchmarking.library import visualize These libraries are illustrated below. They are not formally described at the moment. They come from project code blocks that have noticed to be used identically across projects, so they have been moved to library directories to be easier to reuse. Sharing code across projects means extending the risk of unnoticed bugs when updating the library. With this in mind, the question of code sharing vs code duplication takes another direction, as extensive testing is not easy in such a rapidly evolving project. Core helper modules """"""""""""""""""" The ``run`` module '''''''''''''''''' * helper functions to run system commands, toolbox commands, and ``from_config`` toolbox commands: :: def run(command, capture_stdout=False, capture_stderr=False, check=True, protect_shell=True, cwd=None, stdin_file=None, log_command=True) This method allows running a command, capturing or not its stdout/stderr, checking it's return code, chaning it's working directory, protecting it with bash safety flags (``set -o errexit;set -o pipefail;set -o nounset;set -o errtrace``), passing a file as stdin, logging or not the command, ... :: def run_toolbox(group, command, artifact_dir_suffix=None, run_kwargs=None, mute_stdout=None, check=None, **kwargs) This command allows running a toolbox command. ``group, command, kwargs`` are the CLI toolbox command arguments. ``run_kwargs`` allows passing arguments directory to the ``run`` command described above. ``mute_stdout`` allows muting (capturing) the stdout text. ``check`` allows disabling the exception on error check. ``artifact_dir_suffix`` allows appending a suffix to the toolbox directory name (eg, to distinguish two identical calls in the artifacts). :: def run_toolbox_from_config(group, command, prefix=None, suffix=None, show_args=None, extra=None, artifact_dir_suffix=None, mute_stdout=False, check=True, run_kwargs=None) This command allows running a toolbox command with the ``from_config`` helper (see the description of the ``command_args.yaml.j2`` file). ``prefix`` and ``suffix`` allow distinguishing commands in the ``command_args.yaml.j2`` file. ``extra`` allows passing extra arguments that override what is in the template file. ``show_args`` only display the arguments that would be passed to ``run_toolbox.py``.z * ``run_and_catch`` is an helper function for chaining multiple functions without swallowing exceptions: :: exc = None exc = run.run_and_catch( exc, run.run_toolbox, "kserve", "capture_operators_state", run_kwargs=dict(capture_stdout=True), ) exc = run.run_and_catch( exc, run.run_toolbox, "cluster", "capture_environment", run_kwargs=dict(capture_stdout=True), ) if exc: raise exc * helper context to run functions in parallel. If ``exit_on_exception`` is set, the code will exit the process when an exception is catch. Otherwise it will simply raise it. If ``dedicated_dir`` is set, a dedicated directly, based on the ``name`` parameter, will be created. :: class Parallel(object): def __init__(self, name, exit_on_exception=True, dedicated_dir=True): Example: :: def prepare(): with run.Parallel("prepare1") as parallel: parallel.delayed(prepare_rhoai) parallel.delayed(scale_up_sutest) test_settings = config.project.get_config("tests.fine_tuning.test_settings") with run.Parallel("prepare2") as parallel: parallel.delayed(prepare_gpu) parallel.delayed(prepare_namespace, test_settings) with run.Parallel("prepare3") as parallel: parallel.delayed(preload_image_yyy) parallel.delayed(preload_image_xxx) parallel.delayed(preload_image_zzz) The ``env`` module '''''''''''''''''' * ``ARTIFACT_DIR`` thread-safe access to the storage directory. Prefer using this than ``$ARTIFACT_DIR`` which isn't thread safe. * helper context to create a dedicated artifact directory. Based on OpenShift CI, TOPSAIL relies on the ``ARTIFACT_DIR`` environment variable to store its artifacts. Each toolbox command creates a new directory name ``nnn__group__command``, which keeps the directories ordered and easy to follow. However, when many commands are executed, sometimes in parallel, the number of directories increase and becomes hard to understand. This command allows creating subdirectories, to group things logically: Example: :: with env.NextArtifactDir("prepare_namespace"): set_namespace_annotations() download_data_sources(test_settings) The ``config`` module ''''''''''''''''''''' * the ``config.project.get_config()`` helper command to access the configuration. Uses the inline Json format. This object holds the main project configuration. * the ``config.project.set_config(, )`` helper command to update the configuration. Sometimes, it is convenient to store values in the configuration (eg, coming from the command-line). Mind that this is not thread-safe (an error is raised if this command is called in a ``run.Parallel`` context). Mind that this command does not allow creating new configuration fields in the document. Only existing fields can be updated. The ``projects.rhods.library.prepare_rhoai`` library module """"""""""""""""""""""""""""""""""""""""""""""""""""""""""" This library helps with the deployment of RHOAI pre-builds on OpenShift. * ``install_servicemesh()`` installs the ServiceMesh Operator, if not already installed in the cluster (this is a dependency of RHOAI) * ``uninstall_servicemesh(mute=True)`` uninstall the ServiceMesh Operator, if it is installed * ``is_rhoai_installed()`` tells if RHOAI is currently installed or not. * ``install(token_file=None, force=False)`` installs RHOAI, if it is not already installed (unless ``force`` is passed). Mind that the current deployment code only works with the pre-builds of RHOAI, which require a Brew ``token_file``. If the token isn't passed, it is assumed that the cluster already has access to Brew. The ``projects.gpu_operator.library.prepare_gpu_operator`` library module """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" This library helps with the deployment of the GPU stack on OpenShift. * ``prepare_gpu_operator()`` deploys the NFD Operator and the GPU Operator, if they are not already installed. * ``wait_ready(...)`` waits for the GPU Operator stack to be deployed, and optionally enable additional GPU Operator features: * ``enable_time_sharing`` enables the time-sharing capability of the GPU Operator, (configured via the ``command_args.yaml.j2`` file). * ``extend_metrics=True, wait_metrics=True`` enables extra metrics to be captured by the GPU Operator DCGM component (the "well-known" metrics set). If ``wait_metrics`` is enabled, the automation will wait for the DCGM to start reporting these metrics. * ``wait_stack_deployed`` allows disabling the final wait, and only enable the components above. * ``cleanup_gpu_operator()`` undeploys the GPU Operator and the NFD Operator, if they are deployed. * ``add_toleration(effect, key)`` adds a toleration to the GPU Operator DaemonSet Pods. This allows the GPU Operator Pods to be deployed on nodes with specific taints. Mind that this command overrides any toleration previously set. The ``projects.local_ci.library.prepare_user_pods`` library module """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" This library helps with the execution of multi-user TOPSAIL tests. Multi-user tests consist in Pods running inside the cluster, and all executing a TOPSAIL command. Their initialization is synchronized with a barrier, then they wait a configurable delay before starting their script. When they terminate, their file artifacts are collected via a S3 server, and stored locally for post-processing. * ``prepare_base_image_container(namespace)`` builds a TOPSAIL image in a given namespace. The image must be consistent with the commit of TOPSAIL being tested, so the ``BuildConfig`` relies on the PR number of fetch the right commit. The ``apply_prefer_pr`` function provides the helper code to update the configuration with the number of the PR being tested. * ``apply_prefer_pr(pr_number=None)`` inspects the environment to detect the PR number. When running locally, export ``HOMELAB_CI=true`` and ``PULL_NUMBER=...`` for this function to automatically detect the PR number. Mind that this function updates the configuration file, so it cannot run inside a parallel context. * ``delete_istags(namespace)`` cleanups up the istags used by TOPSAIL User Pods. * ``rebuild_driver_image(namespace, pr_number)`` helps refreshing the image when running locally. :: @entrypoint() def rebuild_driver_image(pr_number): namespace = config.project.get_config("base_image.namespace") prepare_user_pods.rebuild_driver_image(namespace, pr_number) * ``cluster_scale_up(user_count)`` scales up the cluster with the right number of nodes (when not running in a bare-metal cluster). * ``prepare_user_pods(user_count)`` prepares the cluster for running a multi-user scale test. Deploys the dependency tools (minio, redis), builds the image, prepare the ServiceAccount that TOPSAIL will use, prepare the secrets that TOPSAIL will have access to ... * ``cleanup_cluster()`` cleanups up the cluster by deleting the User Pod namespace. The ``projects.matrix_benchmarking.library.visualize`` library module """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" This module helps with the post-processing of TOPSAIL results. * ``prepare_matbench()`` is called from the ContainerFile. It installs the ``pip`` dependencies of MatrixBenchmarking. * ``download_and_generate_visualizations(results_dirname)`` is called from the CIs, when replotting. It downloads test results runs the post-processing steps against it. * ``generate_from_dir(results_dirname, generate_lts=None)`` is the main entrypoint of this library. It accepts a directory as argument, and runs the post-processing steps against it. The expected configuration should be further documented ...