Link Search Menu Expand Document

Configure Soda Library

Last modified on 27-Sep-23

× Soda Core, the free, open-source Python library and CLI tool from which Soda Library extends, continues to exist as an OSS project in GitHub, including all Soda Core documentation.

Migrate to Soda Library to connect to Soda Cloud and access all the newest Soda features.

After you install Soda Library, you must create a configuration.yml to provide details for Soda Library to:

  • connect your data source (except Apache Spark DataFrames, which does not use a configuration YAML file)
  • connect to your Soda Cloud account with API keys; Soda Library requires API keys to validate licensing or trial status and run scans for data quality.

Alternatively, you can provide data source connection configurations in the context of a programmatic scan, if you wish.

Configure and run Soda Library
Provide credentials as system variables
Configure the same scan to run in multiple environments
Disable failed rows samples for specific columns
Disable failed rows samples for individual checks
Go further

Configure and run Soda Library

Consider following the Take a sip of Soda quick start that guides you through the steps to configure Soda Library and run a scan of sample data.

  1. Soda Library connects with Spark DataFrames in a unique way, using programmtic scans.
  2. Create a configuration.yml file. This file stores connection details for your data sources and your Soda Cloud account. Use the data source-specific connection configurations details to copy+paste the connection syntax into your file, then adjust the values to correspond with your data source’s details, as in the following example for PostgreSQL.
    You can use system variables to pass sensitive values, if you wish.
     data_source adventureworks:
       type: postgres
       connection:
         host: localhost
         username: postgres
         password: secret
       database: postgres
       schema: public
    
  3. In a browser, navigate to cloud.soda.io/signup to create a new Soda account. If you already have a Soda account, log in.
  4. Navigate to your avatar > Profile, then access the API keys tab. Click the plus icon to generate new API keys. Copy+paste the soda_cloud configuration syntax, including the API keys, into the configuration.yml file, as in the example below.
    • Do not nest the soda_cloud configuration under the datasource configuration.
    • For host, use cloud.soda.io for EU region; use cloud.us.soda.io for USA region, according to your selection when you created your Soda Cloud account.
    • Optionally, provide a value for the scheme property to indicate which scheme to use to initialize the URI instance. If you do not explicitly include a scheme property, Soda uses the default https.
         soda_cloud:
           host: cloud.soda.io
           api_key_id: 2e0ba0cb-**7b
           api_key_secret: 5wdx**aGuRg
           scheme:
      
  5. Save the configuration.yml file, then, in the same directory, create another new YAML file named checks.yml.
  6. A Soda Check is a test that Soda Library performs when it scans a dataset in your data source. The checks YAML file stores the Soda Checks you write using SodaCL, like the check below that ensures a dataset is not empty. Copy+paste the following basic check syntax in your file, then adjust the value for dataset_name to correspond with the name of one of the datasets in your data source.
    checks for dataset_name:
      - row_count > 0
    
  7. Save the changes to the checks.yml file.
  8. Use the following command to run a scan of the data in your data source. Replace the value for my_datasource with the name of the data source you added to your configuration.yml file. Read more about scans.
    soda scan -d my_datasource -c configuration.yml checks.yml
    

    Command-line Output:

    Soda Library 1.0.x
    Scan summary:
    1/1 check PASSED: 
     dim_customer in adventureworks
       row_count > 0 [PASSED]
    All is good. No failures. No warnings. No errors.
    Sending results to Soda Cloud
    Soda Cloud Trace: 67592***474
    
  9. Access your Soda Cloud account in your browser and navigate to Checks to review the same scan output that Soda Library printed in the command-line. configure-results

Provide credentials as system variables

If you wish, you can provide data source login credentials or any of the properties in the configuration YAML file as system variables instead of storing the values directly in the file. System variables persist only for as long as you have the terminal session open in which you created the variable. For a longer-term solution, consider using permanent environment variables stored in your ~/.bash_profile or ~/.zprofile files.

For connection configuration values

  1. From your command-line interface, set a system variable to store the value of a property that the configuration YAML file uses. For example, you can use the following command to define a system variable for your password.
    export POSTGRES_PASSWORD=1234
    
  2. Test that the system retrieves the value that you set by running an echo command.
    echo $POSTGRES_PASSWORD
    
  3. In the configuration YAML file, set the value of the property to reference the environment variable, as in the following example.
    data_source my_database_name:
      type: postgres
      connection:
     host: soda-temp-demo
     port: '5432'
     username: sodademo
     password: ${POSTGRES_PASSWORD}
     database: postgres
     schema: public
    
  4. Save the configuration YAML file, then run a scan to confirm that Soda Library connects to your data source without issue.
    soda scan -d your_datasource -c configuration.yml checks.yml
    

For API key values

  1. From your command-line interface, set a system variable to store the value of a property that the configuration YAML file uses. For example, you can use the following command to define a system variable for your password.
    export API_KEY=1234
    
  2. Test that the system retrieves the value that you set by running an echo command.
    echo $API_KEY
    
  3. In the configuration YAML file, set the value of the property to reference the environment variable, as in the following example.
     data_source my_database_name:
       type: postgres
       connection:
         host: soda-temp-demo
         port: '5432'
         username: sodademo
         password: ${POSTGRES_PASSWORD}
         database: postgres
         schema: public
        
     soda_cloud:
       host: cloud.soda.io
       api_key_id: ${API_KEY}
       api_key_secret: ${API_SECRET}
    
  4. Save the configuration YAML file, then run a scan to confirm that Soda Library connects to Soda Cloud without issue.
    soda scan -d your_datasource -c configuration.yml checks.yml
    

Configure the same scan to run in multiple environments

When you want to run a scan that executes the same checks on different environments or schemas, such as development, production, and staging, you must apply the following configurations to ensure that Soda Cloud does not incomprehensibly merge the checks results from scans of multiple environments.

  1. In your configuration.yml file, provide separate connection configurations for each environment, as in the following example.
    data_source nyc_dev:
      type: postgres
      connection:
     host: host
     port: '5432'
     username: ${POSTGRES_USER}
     password: ${POSTGRES_PASSWORD}
     database: postgres
     schema: public
    data_source nyc_prod:
      type: postgres
      connection:
     host: host
     port: '5432'
     username: ${POSTGRES_USER}
     password: ${POSTGRES_PASSWORD}
     database: postgres
     schema: public
    
  2. Provide a scan definition name at scan time using the -s option. The scan definition helps Soda Cloud to distinguish different scan contexts and therefore plays a crucial role when the checks.yml file names and the checks themselves are the same.
    # for NYC data source for dev
    soda scan -d nyc_dev -c configuration.yml -s nyc_a checks.yml
    # for NYC data source for prod
    soda scan -d nyc_prod -c configuration.yml -s nyc_b checks.yml
    

See also: Troubleshoot missing check results
See also: Add a check identity

Disable failed rows samples for specific columns

For checks which implicitly or explcitly collect failed rows samples, you can add a configuration to your configuration YAML file to prevent Soda from collecting failed rows samples from specific columns that contain sensitive data.

Refer to Disable failed rows sampling for specific columns.

Disable failed row samples for individual checks

For checks which implicitly or explcitly collect failed rows samples, you can set the samples limit to 0 to prevent Soda from collecting failed rows samples (and sending the samples to Soda Cloud) for an individual check, as in the following example.

checks for dim_customer:
  - missing_percent(email_address) < 50:
      samples limit: 0


Go further


Was this documentation helpful?

What could we do to improve this page?

Documentation always applies to the latest version of Soda products
Last modified on 27-Sep-23