Configure Soda Core
Last modified on 31-May-23
After you install Soda Core, you must create a configuration.yml
to provide details for Soda Core to connect your data source (except Apache Spark DataFrames, which does not use a configuration YAML file).
Alternatively, you can provide data source connection configurations in the context of a programmatic scan, if you wish.
Configuration instructions
Provide credentials as system variables
Configure the same scan to run in multiple environments
Disable failed rows samples for specific columns
Disable failed rows samples for individual checks
Go further
Configuration instructions
Consider following the Take a sip of Soda quick start that guides you through the steps to configure Soda Core and run a scan of your data.
- Soda Core connects with Spark DataFrames in a unique way, using programmtic scans.
- If you are using Spark DataFrames, follow the configuration details in Connect to Apache Spark DataFrames.
- If you are not using Spark DataFrames, continue to step 2.
- Create a
configuration.yml
file. This file stores connection details for your data sources. Use the data source-specific connection configurations listed below to copy+paste the connection syntax into your file, then adjust the values to correspond with your data source’s details. You can use system variables to pass sensitive values, if you wish. Access connection details in Connect a data source section of Soda documentation. - Save the
configuration.yml
file, then create another new YAML file namedchecks.yml
. - A Soda Check is a test that Soda Core performs when it scans a dataset in your data source. The checks YAML file stores the Soda Checks you write using SodaCL. Copy+paste the following basic check syntax in your file, then adjust the value for
dataset_name
to correspond with the name of one of the datasets in your data source.checks for dataset_name: - row_count > 0
- Save the changes to the
checks.yml
file. - Next: run a scan of the data in your data source.
Provide credentials as system variables
If you wish, you can provide data source login credentials or any of the properties in the configuration YAML file as system variables instead of storing the values directly in the file. System variables persist only for as long as you have the terminal session open in which you created the variable. For a longer-term solution, consider using permanent environment variables stored in your ~/.bash_profile
or ~/.zprofile
files.
- From your command-line interface, set a system variable to store the value of a property that the configuration YAML file uses. For example, you can use the following command to define a system variable for your password.
export POSTGRES_PASSWORD=1234
- Test that the system retrieves the value that you set by running an
echo
command.echo $POSTGRES_PASSWORD
- In the configuration YAML file, set the value of the property to reference the environment variable, as in the following example.
data_source my_database_name: type: postgres connection: host: soda-temp-demo port: '5432' username: sodademo password: ${POSTGRES_PASSWORD} database: postgres schema: public
- Save the configuration YAML file, then run a scan to confirm that Soda Core connects to your data source without issue.
soda scan -d your_datasource -c configuration.yml checks.yml
Configure the same scan to run in multiple environments
When you want to run a scan that executes the same checks on different environments or schemas, such as development, production, and staging, you must apply the following configurations to ensure that Soda Cloud does not incomprehensibly merge the checks results from scans of multiple environments.
- Ensure that you are using Soda Core 3.0.7 or later. See instructions for upgrading.
- In your
configuration.yml
file, provide separate connection configurations for each environment, as in the following example.data_source nyc_dev: type: postgres connection: host: host port: '5432' username: ${POSTGRES_USER} password: ${POSTGRES_PASSWORD} database: postgres schema: public data_source nyc_prod: type: postgres connection: host: host port: '5432' username: ${POSTGRES_USER} password: ${POSTGRES_PASSWORD} database: postgres schema: public
- Provide a
scan definition
name at scan time using the-s
option. The scan definition helps Soda Cloud to distinguish different scan contexts and therefore plays a crucial role when thechecks.yml
file names and the checks themselves are the same.# for NYC data source for dev soda scan -d nyc_dev -c configuration.yml -s nyc_a checks.yml # for NYC data source for prod soda scan -d nyc_prod -c configuration.yml -s nyc_b checks.yml
See also: Troubleshoot missing check results
See also: Add a check identity
Disable failed rows samples for specific columns
For checks which implicitly or explcitly collect failed rows samples, you can add a configuration to your configuration YAML file to prevent Soda from collecting failed rows samples from specific columns that contain sensitive data.
Refer to Disable failed rows sampling for specific columns.
Disable failed row samples for individual checks
For checks which implicitly or explcitly collect failed rows samples, you can set the samples limit
to 0
to prevent Soda from collecting failed rows samples (and sending the samples to Soda Cloud, if you have connected it to Soda Core) for an individual check, as in the following example.
checks for dim_customer:
- missing_percent(email_address) < 50:
samples limit: 0
Go further
- Next: Run a scan of the data in your data source.
- Consider completing the Quick start for SodaCL to learn how to write more checks for data quality.
- (Optional) Connect Soda Core to a Soda Cloud account.
- Need help? Join the Soda community on Slack.
Was this documentation helpful?
What could we do to improve this page?
- Suggest a docs change in GitHub.
- Share feedback in the Soda community on Slack.
Last modified on 31-May-23