Connect Soda Core to Soda Cloud
Last modified on 31-May-23
To use all the features and functionality that Soda has to offer, you can install and configure the Soda command-line tool, then connect it to your Soda account.
Soda Core uses an API to connect to Soda Cloud. To use the API, you must generate API keys in your Soda Cloud account, then add them to the configuration YAML file that Soda Core uses to connect to your data sources. Note that the API keys you create do not expire.
Prerequisites
Connect
Connect Soda Core for SparkDF to Soda Cloud
Provide credentials as system variables
Go further
Prerequisites
- You have installed and configured Soda Core and run at least one scan of your data.
OR - You have set up Soda Core and run at least one scan of your data.
Connect
- If you have not already done so, create a Soda Cloud account at https://cloud.soda.io/signup. Select a region for your account based on where you wish to store Soda Cloud data.
- Open your
configuration.yml
file in a text editor, then add the following to the file, leaving the values blank for now.- Be sure to add the syntax for
soda_cloud
at the root level of the YAML file, not nested under any otherdata_source
syntax. - Consider creating system or environment variables for the values of your API key and secret; see Provide credentials as system variables.
soda_cloud: # For host, use cloud.soda.io for EU region, use cloud.us.soda.io for USA region host: cloud.soda.io api_key_id: api_key_secret: # Optional scheme:
- Be sure to add the syntax for
- In your Soda Cloud account, navigate to your avatar > Profile, then navigate to the API Keys tab. Click the plus icon to generate new API keys.
- Copy the API Key ID, then paste it into the
configuration.yml
as the value forapi_key_id
. - Copy the API Key Secret, then paste it into the
configuration.yml
as the value forapi_key_secret
. - Optionally, provide a value for the
scheme
property to indicate which scheme to use to initialize the URI instance. If you do not explicitly include ascheme
property, Soda uses the defaulthttps
.
- Copy the API Key ID, then paste it into the
- Save the changes to the
configuration.yml
file. Close the Create API Key dialog box in Soda Cloud. - From the command-line, use Soda Core to scan the datasets in your data source again.
soda scan -d your_datasource_name -c configuration.yml checks.yml
- Navigate to your Soda Cloud account in your browser review the results of your latest scan in Check Results.
Connect Soda Core for SparkDF to Soda Cloud
Unlike other data sources, Soda Core does not require a configuration YAML file to run scans against Spark DataFrames. It is for use with programmatic Soda scans, only.
Therefore, to connect to Soda Cloud, include the Soda Cloud API keys (see step 3, above) in your programmatic scan using either add_configuration_yaml_file(file_path)
or scan.add_configuration_yaml_str(config_string)
as in the example below.
from pyspark.sql import SparkSession, types
from soda.scan import Scan
spark_session = SparkSession.builder.master("local").appName("test").getOrCreate()
df = spark_session.createDataFrame(
data=[{"id": "1", "name": "John Doe"}],
schema=types.StructType(
[types.StructField("id", types.StringType()), types.StructField("name", types.StringType())]
),
)
df.createOrReplaceTempView("users")
scan = Scan()
scan.set_verbose(True)
scan.set_scan_definition_name("YOUR_SCHEDULE_NAME")
scan.set_data_source_name("spark_df")
scan.add_configuration_yaml_file(file_path="sodacl_spark_df/configuration.yml")
scan.add_configuration_yaml_str(
"""
soda_cloud:
api_key_id: "[key]"
api_key_secret: "[secret]"
host: cloud.soda.io
"""
)
scan.add_spark_session(spark_session)
scan.add_sodacl_yaml_file(file_path="sodacl_spark_df/checks.yml")
# ... all other scan methods in the standard programmatic scan ...
scan.execute()
# print(scan.get_all_checks_text())
print(scan.get_logs_text())
# scan.assert_no_checks_fail()
Refer to the soda-core repo in GitHub for details.
Provide credentials as system variables
If you wish, you can provide API key credentials or any of the properties in the configuration YAML file as system variables instead of storing the values directly in the file. System variables persist only for as long as you have the terminal session open in which you created the variable. For a longer-term solution, consider using permanent environment variables stored in your ~/.bash_profile
or ~/.zprofile
files.
- From your command-line interface, set a system variable to store the value of a property that the configuration YAML file uses. For example, you can use the following command to define a system variable for your password.
export API_KEY=1234
- Test that the system retrieves the value that you set by running an
echo
command.echo $API_KEY
- In the configuration YAML file, set the value of the property to reference the environment variable, as in the following example.
data_source my_database_name: type: postgres connection: host: soda-temp-demo port: '5432' username: sodademo password: ${POSTGRES_PASSWORD} database: postgres schema: public soda_cloud: host: cloud.soda.io api_key_id: ${API_KEY} api_key_secret: ${API_SECRET}
- Save the configuration YAML file, then run a scan to confirm that Soda Core connects to Soda Cloud without issue.
soda scan -d your_datasource -c configuration.yml checks.yml
- Navigate to your Soda Cloud account in your browser review the results of your latest scan in Check Results.
Go further
- Learn more about using SodaCL to write checks for data quality.
- Learn more about viewing failed rows in Soda Cloud.
- Learn more about Soda Cloud architecture.
- Need help? Join the Soda community on Slack.
Was this documentation helpful?
What could we do to improve this page?
- Suggest a docs change in GitHub.
- Share feedback in the Soda community on Slack.
Last modified on 31-May-23