Quick start for Soda Core
Last modified on 26-Jan-23
Use your command-line interface to connect Soda Core to a data source, create and examine the checks that surface “bad” data in a dataset, then run your first scan in a few minutes.
After you run your scan from the command-line, consider going further by signing up for a free trial account in Soda Cloud, the web application that offers data quality visualizations and much more.
Tutorial prerequisites
Install Soda Core
Connect Soda Core to a data source
Write a check and run a scan
(Optional) Connect Soda Core to Soda Cloud
Tutorial prerequisites
This tutorial references a MacOS development environment with a PostgreSQL data source.
- Python 3.8 or greater
- Pip 21.0 or greater
- access details and credentials for a data source
- a code editor such as Sublime or Visual Studio Code
Install Soda Core
- Best practice dictates that you install the Soda Core CLI using a virtual environment. In your command-line interface, create a virtual environment in the
.venv
directory, then activate it and update the version of pip.python -m venv .venv source .venv/bin/activate pip install --upgrade pip
- Execute the following command to install Soda Core in your virtual environment.
pip install soda-core-postgres
- Validate the installation using the
soda
command. The command-line output is similar to the following.Usage: soda [OPTIONS] COMMAND [ARGS]... Soda Core CLI version 3.0.bx Options: --help Show this message and exit. Commands: scan runs a scan update-dro updates a distribution reference file
Connect Soda Core to a data source
There are several other install packages for Soda Core that correspond to different data sources. This tutorial references a MacOS development environment with a PostgreSQL data source.
- In your command-line interface, create a Soda tutorial project directory in your local environment, then navigate to the directory.
mkdir soda_tutorial cd soda_tutorial
- Create a new file called
configuration.yml
. - Open the
configuration.yml
file in a code editor, then copy and paste the following connection details into the file. Replace the values for each of the fields with your own data source-specific details, then save the file.
data_source my_database_name:
type: postgres
connection:
host: soda-temp-demo
port: '5432'
username: sodademo
password: ${POSTGRES_PASSWORD}
database: postgres
schema: public
- Replace
my_database_name
with the name of your database; replacepostgres
with the type of data source to which you are connecting. - Note that
connection:
is a header, not a field. - All values are required.
- Consider using system variables to securely store the values of your username and password. Refer to Configure Soda Core for details.
Write a check and run a scan
- Using Finder or Terminal, create another file in the
soda-tutorial
directory calledchecks.yml
. A Soda Check is a test that Soda Core performs when it scans a dataset in your data source. Thechecks.yml
file stores the Soda Checks you write using the Soda Checks Language (SodaCL). - Open the
checks.yml
file in a code editor, then copy and paste the following Soda Check into the file. Replace the value formy_dataset
with the name of a dataset in your data source.
This simple check validates that the dataset contains more than zero rows, which is to say, that it is not empty.checks for my_dataset: - row_count > 0
- Save the changes to the
checks.yml
file, then, in Terminal, use the following command to run a scan. As input, the command requires:- the name of the data source to scan; replace the value for
my_database_name
with the name of your PostgreSQL database - the filepath and name of the
configuration.yml
file - the filepath and name of the
checks.yml
file
Command:soda scan -d my_database_name -c configuration.yml checks.yml
Output:
Soda Core 3.0.xx Scan summary: 1/1 check PASSED: my_dataset in my_database_name row_count > 0 [PASSED] All is good. No failures. No warnings. No errors.
- the name of the data source to scan; replace the value for
- The CLI output indicates that the check passed, confirming that the dataset is not empty. (Optional) To witness an example of scan output with a failed check, open the
checks.yml
file and change the check to read:row_count < 5
, then save the file. - (Optional) Run the same scan command to see a different scan results in the CLI.
Command:soda scan -d my_database_name -c configuration.yml checks.yml
Output:
Soda Core 3.0.xx Scan summary: 1/1 check FAILED: my_dataset in my_database_name row_count < 5 [FAILED] check_value: 1329 Oops! 1 failures. 0 warnings. 0 errors. 0 pass.
- (Optional) To see more detail in the scan results output in the CLI, add the
-V
option to the scan command to return a verbose version of the output.
Command:soda scan -d my_database_name -c configuration.yml -V checks.yml
Output:
Soda Core 3.0.xx Reading configuration file "/Users/username/.soda/configuration.yml" Reading SodaCL file "checks.yml" Scan execution starts Query aws_postgres_retail.orders.aggregation[0]: SELECT COUNT(*) FROM orders Scan summary: 1/1 query OK my_database_name.my_dataset.aggregation[0] [OK] 0:00:00.285771 1/1 check FAILED: my_dataset in my_database_name row_count < 5 [FAILED] check_value: 1329 Oops! 1 failures. 0 warnings. 0 errors. 0 pass.
- (Optional) If you like, adjust or add more checks to the
checks.yml
file to further explore the things that SodaCL can do. - To exit the virtual environment in your command-line interface, type
deactivate
then press enter.
OR
Continue to the next section to connect Soda Core to a Soda Cloud account.
(Optional) Connect Soda Core to Soda Cloud
Though you can use Soda Core as a standalone CLI tool to monitor data quality, you may wish to connect to the Soda Cloud web application that vastly enriches the data quality monitoring experience.
Beyond increasing the observability of your data, Soda Cloud enables you to automatically detect anomalies, and view samples of the rows that failed a test during a scan. Integrate Soda Cloud with your Slack workspace to collaborate with your team on data monitoring.
Soda Core uses an API to connect to Soda Cloud. To use the API, you must generate API keys in your Soda Cloud account, then add them to the configuration YAML file. When it runs a scan, Soda Core pushes the test results to Soda Cloud.
- If you have not already done so, create a Soda Cloud account at cloud.soda.io.
- In a code editor, open your
configuration.yml
file, then add thesoda_cloud
syntax to the file, as in the example below.data_source my_database_name: type: postgres connection: host: soda-temp-demo port: '5432' username: sodademo password: ${POSTGRES_PASSWORD} database: postgres schema: public soda_cloud: host: cloud.soda.io api_key_id: api_key_secret:
- Save the
configuration.yml
file. - In Soda Cloud, navigate to your avatar > Profile > API Keys, then click the plus icon to generate new API keys.
- Copy the API Key ID, then paste it into the
configuration.yml
file as the value forapi_key_id
. - Copy the API Key Secret, then paste it into the
configuration.yml
file as the value forapi_key_secret
.
You may wish to securely store the values for the API keys as system variables.
- Copy the API Key ID, then paste it into the
- Save the changes to the
configuration.yml
file. Close the Create API Key dialog box in your Soda Cloud account. - From the command-line, in your
soda_tutorial
directory, use Soda Core to scan the datasets in your data source again.
Command:soda scan -d my_database_name -c configuration.yml checks.yml
Output:
Soda Core 3.0.xx Scan summary: 1/1 check FAILED: my_dataset in my_database_name row_count > 0 [FAILED] check_value: 1329 Oops! 1 failures. 0 warnings. 0 errors. 0 pass. Sending results to Soda Cloud
- Go to your Soda Cloud account in your browser and navigate to the Checks dashboard. Review the results of your scan in Check Results.
- Navigate to the Datasets dashboard, then click to select the name of your dataset to review statistics and metadata about the dataset.
- Explore Soda Cloud!
- integrate your Slack workspace to receive notifications of failed checks and collaborate on data quality investigations
- set up or modify notifications for the checks in your account (Check > Check Results > stacked dots > Edit Check)
- open and track data quality incidents and collaborate to resolve them with your team in Slack
To exit the virtual environment in your command-line interface, type deactivate
then press enter.
Go further
- Consider completing the Quick start for SodaCL.
- Explore the built-in metrics and checks you can use with SodaCL.
- Set up programmatic scans to automate data quality monitoring.
- Need help? Join the Soda community on Slack.
Was this documentation helpful?
What could we do to improve this page?
- Suggest a docs change in GitHub.
- Share feedback in the Soda community on Slack.
Last modified on 26-Jan-23