Link Search Menu Expand Document

Connect Soda Core to Soda Cloud

To use all the features and functionality that Soda Cloud and Soda Core have to offer, you can install and configure the Soda Core command-line tool, then connect it to your Soda Cloud account.

Soda Core uses an API to connect to Soda Cloud. To use the API, you must generate API keys in your Soda Cloud account, then add them to the configuration YAML file that Soda Core uses to connect to your data sources. Note that the API keys you create do not expire.

Prerequisites
Connect
Connect Soda Core for SparkDF to Soda Cloud
Go further

Prerequisites

Connect

  1. If you have not already done so, create a Soda Cloud account at cloud.soda.io.
  2. Open your configuration.yml file in a text editor, then add the following to the file. Be sure to add the syntax for soda_cloud at the root level of the YAML file, not nested under any other data_source syntax.
    soda_cloud:
      host: cloud.soda.io
      api_key_id:
      api_key_secret:
    
  3. In your Soda Cloud account, navigate to your avatar > Profile > API Keys, then click the plus icon to generate new API keys.
    • Copy the API Key ID, then paste it into the configuration.yml as the value for api_key_id.
    • Copy the API Key Secret, then paste it into the configuration.yml as the value for api_key_secret.
  4. Save the changes to the configuration.yml file. Close the Create API Key dialog box in Soda Cloud.
  5. From the command-line, use Soda Core to scan the datasets in your data source again.
    soda scan -d your_datasource_name -c configuration.yml checks.yml
    
  6. Navigate to your Soda Cloud account in your browser review the results of your latest scan in Check Results.

Connect Soda Core for SparkDF to Soda Cloud

Unlike other data sources, Soda Core does not require a configuration YAML file to run scans against Spark DataFrames. It is for use with programmatic Soda scans, only.

Therefore, to connect to Soda Cloud, include the Soda Cloud API keys (see step 3, above) in your programmatic scan using either add_configuration_yaml_file(file_path) or scan.add_configuration_yaml_str(config_string) as in the example below.

from pyspark.sql import SparkSession, types
from soda.scan import Scan

spark_session = SparkSession.builder.master("local").appName("test").getOrCreate()
df = spark_session.createDataFrame(
    data=[{"id": "1", "name": "John Doe"}],
    schema=types.StructType(
        [types.StructField("id", types.StringType()), types.StructField("name", types.StringType())]
    ),
)
df.createOrReplaceTempView("users")

scan = Scan()
scan.set_verbose(True)
scan.set_scan_definition_name("YOUR_SCHEDULE_NAME")
scan.set_data_source_name("spark_df")
scan.add_configuration_yaml_file(file_path="sodacl_spark_df/configuration.yml")
scan.add_configuration_yaml_str(
    """
soda_cloud:
  api_key_id: "[key]"
  api_key_secret: "[secret]"
  host: cloud.soda.io
"""
)
scan.add_spark_session(spark_session)
scan.add_sodacl_yaml_file(file_path="sodacl_spark_df/checks.yml")
# ... all other scan methods in the standard programmatic scan ...
scan.execute()

# print(scan.get_all_checks_text())
print(scan.get_logs_text())
# scan.assert_no_checks_fail()

Refer to the soda-core repo in GitHub for details.

Go further



Was this documentation helpful?

What could we do to improve this page?


Last modified on 30-Sep-22