Creating a new visualization module =================================== TOPSAIL post-processing/visualization rely on MatrixBenchmarking modules. The post-processing steps are configured within the ``matbench`` field of the configuration file: :: matbench: preset: null workload: projects.fine_tuning.visualizations.fine_tuning config_file: plots.yaml download: mode: prefer_cache url: url_file: # if true, copy the results downloaded by `matbench download` into the artifacts directory save_to_artifacts: false # directory to plot. Set by testing/common/visualize.py before launching the visualization test_directory: null lts: generate: true horreum: test_name: null opensearch: export: enabled: false enabled_on_replot: false fail_test_on_fail: true instance: smoke index: topsail-fine-tuning index_prefix: "" prom_index_suffix: -prom regression_analyses: enabled: false # if the regression analyses fail, mark the test as failed fail_test_on_regression: false The visualization modules are split into several sub-modules, that are described below. The ``store`` module -------------------- The ``store`` module is built as an extension of ``projects.matrix_benchmarking.visualizations.helpers.store``, which defines the ``store`` architecture usually used in TOPSAIL. :: local_store = helpers_store.BaseStore( cache_filename=CACHE_FILENAME, important_files=IMPORTANT_FILES, artifact_dirnames=parsers.artifact_dirnames, artifact_paths=parsers.artifact_paths, parse_always=parsers.parse_always, parse_once=parsers.parse_once, # --- lts_payload_model=models_lts.Payload, generate_lts_payload=lts_parser.generate_lts_payload, # --- models_kpis=models_kpi.KPIs, get_kpi_labels=lts_parser.get_kpi_labels, ) The upper part defines the core of the ``store`` module. It is mandatory. The lower parts define the LTS payload and KPIs. This part if optional, and only required to push KPIs to OpenSearch. The store parsers ~~~~~~~~~~~~~~~~~ The goal of the ``store.parsers`` module is to turn TOPSAIL test artifacts directories into a Python object, that can be plotted or turned into LTS KPIs. The parsers of the main workload components rely on the ``simple`` store. :: store_simple.register_custom_parse_results(local_store.parse_directory) The ``simple`` store searches for a ``settings.yaml`` file and an ``exit_code`` file. When these two files are found, the parsing of a test begins, and the current directory is considered a test root directory. The parsing is done this way: :: if exists(CACHE_FILE) and not MATBENCH_STORE_IGNORE_CACHE == true: results = reload(CACHE_FILE) else: results = parse_once() parse_always(results) results.lts = parse_lts(results) return results This organization improves the flexibility of the parsers, wrt to what takes time (should be in ``parse_once``) vs what depends on the current execution environment (should be in ``parse_always``). Mind that if you are working on the parsers, you should disable the cache, or your modifications will not be taken into account. :: export MATBENCH_STORE_IGNORE_CACHE=true You can re-enable it afterwards with: :: unset MATBENCH_STORE_IGNORE_CACHE The results of the main parser is a ``types.SimpleNamespace`` object. By choice, it is weakly (on the fly) defined, so the developers must take care to properly propagate any modification of the structure. We tested having a Pydantic model, but that turned out to be to cumbersome to maintain. Could be retested. The important part of the parser is triggered by the execution of this method: :: def parse_once(results, dirname): results.test_config = helpers_store_parsers.parse_test_config(dirname) results.test_uuid = helpers_store_parsers.parse_test_uuid(dirname) ... This ``parse_once`` method is in charge of transforming a directory (``dirname``) into a Python object (``results``). The parse heavily relies on ``obj = types.SimpleNamespace()`` objects, which are dictionaries which fields can be access as attributes. The inner dictionary can be accessed with ``obj.__dict__`` for programmatic traversal. The ``parse_once`` method should delegate the parsing to submethods, which typically looks like this (safety checks have been removed for readability): :: def parse_once(results, dirname): ... results.finish_reason = _parse_finish_reason(dirname) .... @helpers_store_parsers.ignore_file_not_found def _parse_finish_reason(dirname): finish_reason = types.SimpleNamespace() finish_reason.exit_code = None with open(register_important_file(dirname, artifact_paths.FINE_TUNING_RUN_FINE_TUNING_DIR / "artifacts/pod.json")) as f: pod_def = json.load(f) finish_reason.exit_code = container_terminated_state["exitCode"] return finish_reason Note that: * for efficiency, JSON parsing should be preferred to YAML parsing, which is much slower. * for grep-ability, the ``results.xxx`` field name should match the variable defined in the method (``xxx = types.SimpleNamespace()``) * the ``ignore_file_not_found`` decorator will catch ``FileNotFoundError`` exceptions and return ``None`` instead. This makes the code resilient against not-generated artifacts. This happens "often" while performing investigations in TOPSAIL, because the test failed in an unexpected way. The visualization is expected to perform as best as possible when this happens (graceful degradation), so that the rest of the artifacts can be exploited to understand what happened and caused the failure. The difference between these two methods: :: def parse_once(results, dirname): ... def parse_always(results, dirname, import_settings): .. is that ``parse_once`` is called once, then the results is saved into a cache file, and reloaded from there, the environment variable ``MATBENCH_STORE_IGNORE_CACHE=y`` is set. Method ``parse_always`` is always called, even after reloading the cache file. This can be used to parse information about the environment in which the post-processing is executed. :: artifact_dirnames = types.SimpleNamespace() artifact_dirnames.CLUSTER_CAPTURE_ENV_DIR = "*__cluster__capture_environment" artifact_dirnames.FINE_TUNING_RUN_FINE_TUNING_DIR = "*__fine_tuning__run_fine_tuning_job" artifact_dirnames.RHODS_CAPTURE_STATE = "*__rhods__capture_state" artifact_paths = types.SimpleNamespace() # will be dynamically populated This block is used to lookup the directories where the files to be parsed are stored (the prefix ``nnn__`` can change easily, so it shouldn't be hardcoded). During the initialization of the store module, the directories listed by ``artifacts_dirnames`` are resolved and stored in the ``artifacts_paths`` namespace. They can be used in the parser with, eg: ``artifact_paths.FINE_TUNING_RUN_FINE_TUNING_DIR / "artifacts/pod.log"``. If the directory blob does not resolve to a file, its value is ``None``. :: IMPORTANT_FILES = [ ".uuid", "config.yaml", f"{artifact_dirnames.CLUSTER_CAPTURE_ENV_DIR}/_ansible.log", f"{artifact_dirnames.CLUSTER_CAPTURE_ENV_DIR}/nodes.json", f"{artifact_dirnames.CLUSTER_CAPTURE_ENV_DIR}/ocp_version.yml", f"{artifact_dirnames.FINE_TUNING_RUN_FINE_TUNING_DIR}/src/config_final.json", f"{artifact_dirnames.FINE_TUNING_RUN_FINE_TUNING_DIR}/artifacts/pod.log", f"{artifact_dirnames.FINE_TUNING_RUN_FINE_TUNING_DIR}/artifacts/pod.json", f"{artifact_dirnames.FINE_TUNING_RUN_FINE_TUNING_DIR}/_ansible.play.yaml", f"{artifact_dirnames.RHODS_CAPTURE_STATE}/rhods.createdAt", f"{artifact_dirnames.RHODS_CAPTURE_STATE}/rhods.version", ] This block defines the files important for the parsing. They are "important" and not "mandatory" as the parsing should be able to proceed even if the files are missing. The list of "important files" is used when downloading results for re-processing. The download command can either lookup the cache file, or download all the important files. A warning is issued during the parsing if a file opened with ``register_important_file`` is not part of the import files list. The ``store`` and ``models`` LTS and KPI modules ------------------------------------------------ The Long-Term Storage (LTS) payload and the Key Performance Indicators (KPIs) are TOPSAIL/MatrixBenchmarking features for Continuous Performance Testing (CPT). * The LTS payload is a "complex" object, with ``metadata``, ``results`` and ``kpis`` fields. The ``metadata``, ``results`` are defined with Pydantic models, which enforce their structure. This was the first attempt of TOPSAIL/MatrixBenchmarking to go towards long-term stability of the test results and metadata. This attempt has not been convincing, but it is still part of the pipeline for historical reasons. Any metadata or result can be stored in these two objects, provided that you correctly add the fields in the models. * The KPIs is our current working solution for continuous performance testing. A KPI is a simple object, which consists in a value, a help text, a timestamp, a unit, and a set of labels. The KPIs follow the OpenMetrics idea. :: # HELP kserve_container_cpu_usage_max Max CPU usage of the Kserve container | container_cpu_usage_seconds_total # UNIT kserve_container_cpu_usage_max cores kserve_container_cpu_usage_max{instance_type="g5.2xlarge", accelerator_name="NVIDIA-A10G", ocp_version="4.16.0-rc.6", rhoai_version="2.13.0-rc1+2024-09-02", model_name="flan-t5-small", ...} 1.964734477279039 Currently, the KPIs are part of the LTS payload, and the labels are duplicated for each of the KPIs. This designed will be reconsidered in the near future. Definition of KPI labels and values ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The KPIs are a set of performance indicators and labels. The KPIs are defined by functions which extract the KPI value by inspecting the LTS payload: :: @matbench_models.HigherBetter @matbench_models.KPIMetadata(help="Number of dataset tokens processed per seconds per GPU", unit="tokens/s") def dataset_tokens_per_second_per_gpu(lts_payload): return lts_payload.results.dataset_tokens_per_second_per_gpu the name of the function is the name of the KPI, and the annotation define the metadata and some formatting properties: :: # mandatory @matbench_models.KPIMetadata(help="Number of train tokens processed per GPU per seconds", unit="tokens/s") # one of these two is mandatory @matbench_models.LowerBetter # or @matbench_models.HigherBetter # ignore this KPI in the regression analyse @matbench_models.IgnoredForRegression # simple value formatter @matbench_models.Format("{:.2f}") # formatter with a divisor (and a new unit) @matbench_models.FormatDivisor(1024, unit="GB", format="{:.2f}") The KPI labels are defined via a Pydantic model: :: KPI_SETTINGS_VERSION = "1.0" class Settings(matbench_models.ExclusiveModel): kpi_settings_version: str ocp_version: matbench_models.SemVer rhoai_version: matbench_models.SemVer instance_type: str accelerator_type: str accelerator_count: int model_name: str tuning_method: str per_device_train_batch_size: int batch_size: int max_seq_length: int container_image: str replicas: int accelerators_per_replica: int lora_rank: Optional[int] lora_dropout: Optional[float] lora_alpha: Optional[int] lora_modules: Optional[str] ci_engine: str run_id: str test_path: str urls: Optional[dict[str, str]] So eventually, the KPIs are the combination of the generic part (``matbench_models.KPI``) and project specific labels (``Settings``): :: class KPI(matbench_models.KPI, Settings): pass KPIs = matbench_models.getKPIsModel("KPIs", __name__, kpi.KPIs, KPI) Definition of the LTS payload ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The LTS payload was the original idea of the document to save for continuous performance testing. KPIs have replaced them in this endeavor, but in the current state of the project, the LTS payload includes the KPIs. The LTS payload is the object actually sent to the OpenSearch database. The LTS Payload is composed of three objects: - the metadata (replaced by the KPI labels) - the results (replace by the KPI values) - the KPIs :: LTS_SCHEMA_VERSION = "1.0" class Metadata(matbench_models.Metadata): lts_schema_version: str settings: Settings presets: List[str] config: str ocp_version: matbench_models.SemVer class Results(matbench_models.ExclusiveModel): train_tokens_per_second: float dataset_tokens_per_second: float gpu_hours_per_million_tokens: float dataset_tokens_per_second_per_gpu: float train_tokens_per_gpu_per_second: float train_samples_per_second: float train_runtime: float train_steps_per_second: float avg_tokens_per_sample: float class Payload(matbench_models.ExclusiveModel): metadata: Metadata results: Results kpis: KPIs Generation of the LTS payload ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The generation of the LTS payload is done after the parsing of main artifacts. :: def generate_lts_payload(results, import_settings): lts_payload = types.SimpleNamespace() lts_payload.metadata = generate_lts_metadata(results, import_settings) lts_payload.results = generate_lts_results(results) # lts_payload.kpis is generated in the helper store return lts_payload On purpose, the parser does *not* use the Pydantic model when creating the LTS payload. The reason for that is that the parser is strict. If a field is missing, the object will not be created and an exception will be raised. When TOPSAIL is used for running performance investigations (in particular scale tests), we do not what this, because the test might terminate with some artifacts missing. Hence, the parsing will be incomplete, and we do *not* want that to abort the visualization process. However, when running in continuous performance testing mode, we do want to guarantee that everything is correctly populated. So TOPSAIL will run the parsing twice. First, without checking the LTS conformity: :: matbench parse --output-matrix='.../internal_matrix.json' \ --pretty='True' \ --results-dirname='...' \ --workload='projects.kserve.visualizations.kserve-llm' Then, when LTS generation is enabled, with the LTS checkup: :: matbench parse \ --output-lts='.../lts_payload.json' \ --pretty='True' \ --results-dirname='...' \ --workload='projects.kserve.visualizations.kserve-llm' This step (which reload from the cache file) will be recorded as a failure if the parsing is incomplete. Generation of the KPI values ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The KPI values are generated in two steps: First the ``KPIs`` dictionary is populated when the ``KPIMetadata`` decorator is applied to a function (``function name --> dict with the function, metadata, format, etc``) :: KPIs = {} # populated by the @matbench_models.KPIMetadata decorator # ... @matbench_models.KPIMetadata(help="Number of train tokens processed per seconds", unit="tokens/s") def train_tokens_per_second(lts_payload): return lts_payload.results.train_tokens_per_second Second, when the LTS payload is generated via the ``helpers_store`` :: import projects.matrix_benchmarking.visualizations.helpers.store as helpers_store the LTS payload is passed to the KPI function, and the full KPI is generated. The ``plotting`` visualization module ------------------------------------- The ``plotting`` module contains two kind of classes: the "actual" plotting classes, which generate Plotly plots, and the report classes, which generates HTML pages, based on Plotly's Dash framework. The ``plotting`` plot classes ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The ``plotting`` plot classes generate Plotly plots. They receive a set of parameters about what should be plotted: :: def do_plot(self, ordered_vars, settings, setting_lists, variables, cfg): ... and they return a Plotly figure, and optionally some text to write below the plot: :: return fig, msg The parameters are mostly useful when multiple experiments have been captured: - ``setting_lists`` and ``settings`` should not be touched. They should be passed to ``common.Matrix.all_records``, which will return a filtered list of all the entry to include in the plot. :: for entry in common.Matrix.all_records(settings, setting_lists): # extract plot data from entry pass Some plotting classes may be written to display only one experiment results. A fail-safe exit can be written this way: :: if common.Matrix.count_records(settings, setting_lists) != 1: return {}, "ERROR: only one experiment must be selected" - the ``variables`` dictionary tells which settings have multiple values. Eg, we may have 6 experiments, all with ``model_name=llama3``, but with ``virtual_users=[4, 16, 32]`` and ``deployment_type=[raw, knative]``. In this case, the ``virtual_users`` and ``deployment_type`` will be listed in the ``variables``. This is useful to give a name to each entry. Eg, here, ``entry.get_name(variables)`` may return ``virtual_users=16, deployment_type=raw``. - the ``ordered_vars`` list tells the preferred ordering for processing the experiments. With the example above and ``ordered_vars=[virtual_users, deployment_type]``, we may want to use the virtual_user setting as legend. With ``ordered_vars=[deployment_type, virtual_users]``, we may want to use the ``deployment_type`` instead. This gives flexibility in the way the plots are rendered. This order can be set in the GUI, or via the reporting calls. Note that using these parameters is optional. They have no sense when only one experiment should be plotted, and ``ordered_vars`` is useful only when using the GUI, or when generating reports. They help the generic processing of the results. - the ``cfg`` dictionary provides some dynamic configuration flags to perform the visualization. They can be passed either via the GUI, or by the report classes (eg, to highlight a particular aspect of the plot). Guideline for writing the plotting classes """""""""""""""""""""""""""""""""""""""""" Writing a plotting class is often messy and dirty, with a lot of ``if`` this ``else`` that. With Plotly's initial framework ``plotly.graph_objs``, it was easy and tempting to mix the data preparation (traversing the data structures) with the data visualization (adding elements like lines to the plot), and do both parts in the same loops. Plotly express (``plotly.express``) introduced a new way to generate the plots, based on Pandas DataFrames: :: df = pd.DataFrame(generateThroughputData(entries, variables, ordered_vars, cfg__model_name)) fig = px.line(df, hover_data=df.columns, x="throughput", y="tpot_mean", color="model_testname", text="test_name",) This pattern, where the first phase shapes the data to plot into DataFrame, and the second phase turns the DataFrame into a figure, is the preferred way to organize the code of the plotting classes. The ``plotting`` reports ^^^^^^^^^^^^^^^^^^^^^^^^ The report classes are similar to the plotting classes, except that they generate ... reports, instead of plots (!). A report is an HTML document, based on the Dash framework HTML tags (that is, Python objects): :: args = ordered_vars, settings, setting_lists, variables, cfg header += [html.H1("Latency per token during the load test")] header += Plot_and_Text(f"Latency details", args) header += html.Br() header += html.Br() header += Plot_and_Text(f"Latency distribution", args) header += html.Br() header += html.Br() The configuration dictionary, mentioned above, can be used to generate different flavors of the plot: :: header += Plot_and_Text(f"Latency distribution", set_config(dict(box_plot=False, show_text=False), args)) for entry in common.Matrix.all_records(settings, setting_lists): header += [html.H2(entry.get_name(reversed(sorted(set(list(variables.keys()) + ['model_name'])))))] header += Plot_and_Text(f"Latency details", set_config(dict(entry=entry), args)) Defining the plots and reports to generate ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ When TOPSAIL has successfully run the parsing step, it calls the ``visualization`` component with a predefined list of reports (preferred) and plots (not recommended) to generate. This is stored in ``data/plots.yaml``: :: visualize: - id: llm_test generate: - "report: Error report" - "report: Latency per token" - "report: Throughput" The ``analyze`` regression analyze module ----------------------------------------- The last part of TOPSAIL/MatrixBenchmarking post-processing is the automated regression analyses. The workflow required to enable performance analyses will be described in the orchestration section. What is required in the workload module only consists of a few keys to define. :: # the setting (kpi labels) keys against which the historical regression should be performed COMPARISON_KEYS = ["rhoai_version"] The setting keys listed in ``COMPARISON_KEYS`` will be used to distinguish which entries to considered as "history" for a given test, from everything else. In this example, we see that we compare against historical OpenShift AI versions. :: COMPARISON_KEYS = ["rhoai_version", "image_tag"] Here, we compare against the historical RHOAI version and image tag. :: # the setting (kpi labels) keys that should be ignored when searching for historical results IGNORED_KEYS = ["runtime_image", "ocp_version"] Then we define the settings to ignore when searching for historical records. Here, we ignore the runtime image name, and the OpenShift version. :: # the setting (kpi labels) keys *prefered* for sorting the entries in the regression report SORTING_KEYS = ["model_name", "virtual_users"] Finally, for readability purpose, we define how the entries should be sorted, so that the tables have a consistent ordering. :: IGNORED_ENTRIES = { "virtual_users": [4, 8, 32, 128] } Last, we can define some settings to ignore while traversing the entries that have been tested.