Link Search Menu Expand Document

Connect Soda Core to Soda Cloud

Last modified on 31-May-23

To use all the features and functionality that Soda has to offer, you can install and configure the Soda command-line tool, then connect it to your Soda account.

Soda Core uses an API to connect to Soda Cloud. To use the API, you must generate API keys in your Soda Cloud account, then add them to the configuration YAML file that Soda Core uses to connect to your data sources. Note that the API keys you create do not expire.

Prerequisites
Connect
Connect Soda Core for SparkDF to Soda Cloud
Provide credentials as system variables
Go further

Prerequisites

  • You have installed and configured Soda Core and run at least one scan of your data.
    OR
  • You have set up Soda Core and run at least one scan of your data.

Connect

  1. If you have not already done so, create a Soda Cloud account at https://cloud.soda.io/signup. Select a region for your account based on where you wish to store Soda Cloud data.
  2. Open your configuration.yml file in a text editor, then add the following to the file, leaving the values blank for now.
    • Be sure to add the syntax for soda_cloud at the root level of the YAML file, not nested under any other data_source syntax.
    • Consider creating system or environment variables for the values of your API key and secret; see Provide credentials as system variables.
      soda_cloud:
      # For host, use cloud.soda.io for EU region, use cloud.us.soda.io for USA region 
      host: cloud.soda.io
      api_key_id:
      api_key_secret:
      # Optional
      scheme: 
      
  3. In your Soda Cloud account, navigate to your avatar > Profile, then navigate to the API Keys tab. Click the plus icon to generate new API keys.
    • Copy the API Key ID, then paste it into the configuration.yml as the value for api_key_id.
    • Copy the API Key Secret, then paste it into the configuration.yml as the value for api_key_secret.
    • Optionally, provide a value for the scheme property to indicate which scheme to use to initialize the URI instance. If you do not explicitly include a scheme property, Soda uses the default https.
  4. Save the changes to the configuration.yml file. Close the Create API Key dialog box in Soda Cloud.
  5. From the command-line, use Soda Core to scan the datasets in your data source again.
    soda scan -d your_datasource_name -c configuration.yml checks.yml
    
  6. Navigate to your Soda Cloud account in your browser review the results of your latest scan in Check Results.

Connect Soda Core for SparkDF to Soda Cloud

Unlike other data sources, Soda Core does not require a configuration YAML file to run scans against Spark DataFrames. It is for use with programmatic Soda scans, only.

Therefore, to connect to Soda Cloud, include the Soda Cloud API keys (see step 3, above) in your programmatic scan using either add_configuration_yaml_file(file_path) or scan.add_configuration_yaml_str(config_string) as in the example below.

from pyspark.sql import SparkSession, types
from soda.scan import Scan

spark_session = SparkSession.builder.master("local").appName("test").getOrCreate()
df = spark_session.createDataFrame(
    data=[{"id": "1", "name": "John Doe"}],
    schema=types.StructType(
        [types.StructField("id", types.StringType()), types.StructField("name", types.StringType())]
    ),
)
df.createOrReplaceTempView("users")

scan = Scan()
scan.set_verbose(True)
scan.set_scan_definition_name("YOUR_SCHEDULE_NAME")
scan.set_data_source_name("spark_df")
scan.add_configuration_yaml_file(file_path="sodacl_spark_df/configuration.yml")
scan.add_configuration_yaml_str(
    """
soda_cloud:
  api_key_id: "[key]"
  api_key_secret: "[secret]"
  host: cloud.soda.io
"""
)
scan.add_spark_session(spark_session)
scan.add_sodacl_yaml_file(file_path="sodacl_spark_df/checks.yml")
# ... all other scan methods in the standard programmatic scan ...
scan.execute()

# print(scan.get_all_checks_text())
print(scan.get_logs_text())
# scan.assert_no_checks_fail()

Refer to the soda-core repo in GitHub for details.

Provide credentials as system variables

If you wish, you can provide API key credentials or any of the properties in the configuration YAML file as system variables instead of storing the values directly in the file. System variables persist only for as long as you have the terminal session open in which you created the variable. For a longer-term solution, consider using permanent environment variables stored in your ~/.bash_profile or ~/.zprofile files.

  1. From your command-line interface, set a system variable to store the value of a property that the configuration YAML file uses. For example, you can use the following command to define a system variable for your password.
    export API_KEY=1234
    
  2. Test that the system retrieves the value that you set by running an echo command.
    echo $API_KEY
    
  3. In the configuration YAML file, set the value of the property to reference the environment variable, as in the following example.
     data_source my_database_name:
       type: postgres
       connection:
         host: soda-temp-demo
         port: '5432'
         username: sodademo
         password: ${POSTGRES_PASSWORD}
         database: postgres
         schema: public
        
     soda_cloud:
       host: cloud.soda.io
       api_key_id: ${API_KEY}
       api_key_secret: ${API_SECRET}
    
  4. Save the configuration YAML file, then run a scan to confirm that Soda Core connects to Soda Cloud without issue.
    soda scan -d your_datasource -c configuration.yml checks.yml
    
  5. Navigate to your Soda Cloud account in your browser review the results of your latest scan in Check Results.

Go further



Was this documentation helpful?

What could we do to improve this page?

Last modified on 31-May-23