Automated monitoring checks
Last modified on 27-Sep-23
Use automated monitoring checks to instruct Soda to automatically check for row count anomalies and schema changes in a dataset.
automated monitoring:
datasets:
- include %
- exclude test%
About automated monitoring checks
Prerequisites
Define an automated monitoring check
Optional check configurations
Go further
About automated monitoring checks
When you add automated monitoring checks to your checks.yml file, Soda Library prepares and executes two checks on all the datasets you indicate as included in your checks YAML file.
Anomaly score check on row count: This check counts the number of rows in a dataset during scan and registers anomalous counts relative to previous measurements for the row count metric. Refer to Anomaly score checks for details.
Anomaly score checks require a minimum of four data points (four scans at stable intervals) to establish a baseline against which to gauge anomalies. If you do not see check results immediately, allow Soda Library to accumulate the necessary data points.
Schema checks: This check monitors schema changes in datasets, including column addition, deletion, data type changes, and index changes. By default, this automated check results in a failure if a column is deleted, its type changes, or its index changes; it results in a warning if a column is added. Refer to Schema checks for details.
Schema checks require a minimum of one data point to use as a baseline against which to gauge schema changes. If you do not see check results immediately, wait until after you have scanned the dataset twice.
These types of checks require a Soda Cloud account. Soda Library pushes check results to your account where Soda Cloud stores all the previously-measured, historic values for your checks in the Cloud Metric Store. SodaCL can then use these stored values to establish a relative state against which to evaluate future schema and anomaly score checks.
Prerequisites
- You have signed up for a Soda Cloud account.
- You have Administrator rights within your organization’s Soda Cloud account.
- You, or an Administrator in your organization’s Soda Cloud account, has deployed a Soda Agent which enables you to connect to a data source in Soda Cloud.
To define automated monitoring checks, follow the guided steps to create a new data source. Reference the section below for how to define the checks themselves.
- You have installed a Soda Library package in your environment.
- You have configured Soda Library to connect to a data source using a
configuration.yml
file. - You have created and connected a Soda Cloud account to Soda Library.
- You have installed Soda Scientific in the same directory or virtual environment in which you installed Soda Library; see instructions below.
Install Soda Scientific
To use automated monitoring, you must install Soda Scientific in the same directory or virtual environment in which you installed Soda Library.
- Set up a virtual environment, as described in the Soda Library install documentation.
- Install Soda Library in your new virtual environment.
- Use the following command to install Soda Scientific.
pip install -i https://pypi.cloud.soda.io soda-scientific
List of Soda Scientific dependencis
- pandas<2.0.0
- wheel
- pydantic>=1.8.1,<2.0.0
- scipy>=1.8.0
- numpy>=1.23.3, <2.0.0
- inflection==0.5.1
- httpx>=0.18.1,<2.0.0
- PyYAML>=5.4.1,<7.0.0
- cython>=0.22
- prophet>=1.1.0,<2.0.0
Refer to Troubleshoot Soda Scientific installation for help with issues during installation.
Troubleshoot Soda Scientific installation
While installing Soda Scientific works on Linux, you may encounter issues if you install Soda Scientific on Mac OS (particularly, machines with the M1 ARM-based processor) or any other operating system. If that is the case, consider using one of the following alternative installation procedures.
Need help? Ask the team in the Soda community on Slack.
Install Soda Scientific Locally
- Set up a virtual environment, as described in the Soda Library install documentation.
- Install Soda Library in your new virtual environment.
- Use the following command to install Soda Scientific.
pip install -i https://pypi.cloud.soda.io soda-scientific
List of Soda Scientific dependencis
- pandas<2.0.0
- wheel
- pydantic>=1.8.1,<2.0.0
- scipy>=1.8.0
- numpy>=1.23.3, <2.0.0
- inflection==0.5.1
- httpx>=0.18.1,<2.0.0
- PyYAML>=5.4.1,<7.0.0
- cython>=0.22
- prophet>=1.1.0,<2.0.0
Use Docker to run Soda Library
Use Soda’s Docker image in which Soda Scientific is pre-installed.
- If you have not already done so, install Docker in your local environment.
- From Terminal, run the following command to pull Soda Library’s official Docker image; adjust the version to reflect the most recent release.
docker pull sodadata/soda-library:v1.0.3
- Verify the pull by running the following command.
docker run sodadata/soda-library:v1.0.3 --help
Output:
Usage: soda [OPTIONS] COMMAND [ARGS]... Soda Library CLI version 1.0.x, Soda Core CLI version 3.0.xx Options: --version Show the version and exit. --help Show this message and exit. Commands: ingest Ingests test results from a different tool scan Runs a scan suggest Generates suggestions for a dataset test-connection Tests a connection update-dro Updates contents of a distribution reference file
When you run the Docker image on a non-Linux/amd64 platform, you may see the following warning from Docker, which you can ignore.
WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
- When you are ready to run a Soda scan, use the following command to run the scan via the docker image. Replace the placeholder values with your own file paths and names.
docker run -v /path/to/your_soda_directory:/sodacl sodadata/soda-library scan -d your_data_source -c /sodacl/your_configuration.yml /sodacl/your_checks.yml
Optionally, you can specify the version of Soda Library to use to execute the scan. This may be useful when you do not wish to use the latest released version of Soda Library to run your scans. The example scan command below specifies Soda Library version 1.0.0.
docker run -v /path/to/your_soda_directory:/sodacl sodadata/soda-library:v1.0.0 scan -d your_data_source -c /sodacl/your_configuration.yml /sodacl/your_checks.yml
What does the scan command do?
docker run
ensures that the docker engine runs a specific image.-v
mounts your SodaCL files into the container. In other words, it makes the configuration.yml and checks.yml files in your local environment available to the docker container. The command example maps your local directory to/sodacl
inside of the docker container.sodadata/soda-library
refers to the image thatdocker run
must use.scan
instructs Soda Library to execute a scan of your data.-d
indicates the name of the data source to scan.-c
specifies the filepath and name of the configuration YAML file.
Error: Mounts denied
If you encounter the following error, follow the procedure below.
docker: Error response from daemon: Mounts denied:
The path /soda-library-test/files is not shared from the host and is not known to Docker.
You can configure shared paths from Docker -> Preferences... -> Resources -> File Sharing.
See https://docs.docker.com/desktop/mac for more info.
You need to give Docker permission to acccess your configuration.yml and checks.yml files in your environment. To do so:
- Access your Docker Dashboard, then select Preferences (gear symbol).
- Select Resources, then follow the Docker instructions to add your Soda project directory – the one you use to store your configuration.yml and checks.yml files – to the list of directories that can be bind-mounted into Docker containers.
- Click Apply & Restart, then repeat steps 2 - 4 above.
Error: Configuration path does not exist
If you encounter the following error, double check the syntax of the scan command in step 4 above.
- Be sure to prepend
/sodacl/
to both the congifuration.yml filepath and the checks.yml filepath. - Be sure to mount your files into the container by including the
-v
option. For example,-v /Users/MyName/soda_project:/sodacl
.
Soda Library 1.0.x
Configuration path 'configuration.yml' does not exist
Path "checks.yml" does not exist
Scan summary:
No checks found, 0 checks evaluated.
2 errors.
Oops! 2 errors. 0 failures. 0 warnings. 0 pass.
ERRORS:
Configuration path 'configuration.yml' does not exist
Path "checks.yml" does not exist
Define an automated monitoring check
In the context of SodaCL check types, automated monitoring checks are unique. This check employs the anomaly score
and schema
checks, but is limited in its syntax variation, with only a couple of mutable parts to specify which datasets to automatically apply the anomaly and schema checks.
The example check below uses a wildcard character (%
) to specify that Soda Library executes automated monitoring checks against all datasets with names that begin with prod
, and not to execute the checks against any dataset with a name that begins with test
.
automated monitoring:
datasets:
- include prod%
- exclude test%
You can also specify individual datasets to include or exclude, as in the following example.
automated monitoring:
datasets:
- include orders
Scan results in Soda Cloud
To review the check results for automated monitoring checks in Soda Cloud, navigate to the Checks dashboard to see the automated monitoring check results with an INSIGHT tag.
Optional check configurations
Supported | Configuration | Documentation |
---|---|---|
Define a name for an automated monitoring check. | - | |
Define alert configurations to specify warn and fail thresholds. | - | |
Apply an in-check filter to return results for a specific portion of the data in your dataset. | - | |
Use quotes when identifying dataset names. | - | |
✓ | Use wildcard characters ( % with dataset names in the check; see example. | - |
Use for each to apply anomaly score checks to multiple datasets in one scan. | - | |
Apply a dataset filter to partition data during a scan. | - |
Example with wildcards
automated monitoring:
datasets:
- include prod%
- exclude test%
Go further
- Need help? Join the Soda community on Slack.
- Reference tips and best practices for SodaCL.
- Use a freshness check to gauge how recently your data was captured.
- Use reference checks to compare the values of one column to another.
Was this documentation helpful?
What could we do to improve this page?
- Suggest a docs change in GitHub.
- Share feedback in the Soda community on Slack.
Documentation always applies to the latest version of Soda products
Last modified on 27-Sep-23