Send sample data to Soda Cloud
When creating new monitors in Soda Cloud, you may find it useful to review sample data from your dataset to help you determine the kinds of tests to run when Soda SQL scans your data; see the image below. For this reason, you may wish to configure a samples
configuration key in Soda SQL.
Add a sample configuration key
DO NOT use sample data if your dataset contains sensitive information or personally identifiable information (PII). For security, you can disable the sample data feature, or reroute failed sample data to an alternate location.
- If you have not already done so, connect Soda SQL to your Soda Cloud account.
- Add a
samples
configuration key to your scan YAML file according to the Scan YAML example below; usetable_limit
to define a value that represents the numerical threshold of rows in a dataset that Soda SQL sends to Soda Cloud after it executes a test during a scan. It yields a sample of the data from your dataset in the Sample Data tab when you are creating a new monitor; see image above. A sample contains the first n number of rows from the dataset, according to the limit you specify. - Save the changes to your scan YAML file, then run a scan on that dataset.
soda scan warehouse.yml/tables/orders.yml
- In your Soda Cloud account, navigate to the Monitors dashboard. Click the stacked-dots icon to Create Monitor. Note that in the first step of the guided monitor creation, you can review sample data from your dataset that Soda SQL collected during its last scan of your dataset.
Scan YAML Example
table_name: orders
metrics:
- row_count
- missing_count
- missing_percentage
- values_count
...
samples:
table_limit: 50
tests:
- row_count > 0
columns:
orderid:
valid_format: uuid
tests:
- invalid_percentage <= 3
Using the example scan YAML above, the scan executes both tests against all the data in the dataset, but it only sends a maximum of 50 rows of data and metadata to Soda Cloud for review as sample data when creating a new monitor for the orders
dataset.
The snippet below displays the CLI output of the query that counts the rows in the dataset; Soda SQL counts 193 rows but only sends 50 as a sample to Soda Cloud.
| ...
| Executing SQL query:
SELECT *
FROM "public"."orders"
LIMIT 50;
| SQL took 0:00:00.074957
| Sent sample orders.sample (50/193) to Soda Cloud
| ...
Disable sample data
Where your datasets contain sensitive or private information, you may not want to send sample data from your data source to Soda Cloud. In such a circumstance, you can disable the feature completely in Soda Cloud.
To prevent Soda Cloud from receiving any sample data or failed row samples for any datasets in any data sources to which you have connected your Soda Cloud account, proceed as follows:
- As an Admin, log in to your Soda Cloud account and navigate to your avatar > Organization Settings.
- In the Company tab, check the box to “Disable storage of sample data and failed row samples in Soda Cloud.”, then Save.
Alternatively, you can prevent Soda SQL from sending metadata or samples to Soda Cloud by using one of the following methods:
- To prevent Soda SQL from sending an individual dataset’s scan results or samples to Soda Cloud, use the
--offline
option when you run a scan. - To prevent Soda SQL from sending specific column scan results or samples, configure an
excluded_columns
configuration key in your scan YAML file.
Reroute sample data for a dataset
Use a SampleProcessor
to programmatically send a dataset’s samples to a secure location within your organization’s infrastructure, such as an Amazon S3 bucket or Google Big Query. Note that you can only configure sample data rerouting for individual datasets, and only for those scans that you have scheduled programmatically. In Soda Cloud, users looking for sample data see the message you define advising them where they can access and review sample data for the dataset.
Reroute to Amazon S3
First, configure a SampleProcessor
according to the following example.
import boto
import json
from soda.scan.SampleProcessor
class S3SampleProcessor(SampleProcessor):
# Override the process function
def process(context) → dict:
file_name = 'sample_rows.json'
with open(file_name, 'w', encoding='uft-8') as f:
json.dump(, f)
s3_client = boto3.client('s3')
if object_name is None:
object_name = os.path.basename(file_name)
try:
response = s3_client.upload_file(file_name, bucket, object_name)
except ClientError as e:
logging.error(e)
return {'message': 'Unable to load sample data into S3'}
return {'message':
f'Sample data is stored in S3://{bucket_name}/{file_name}'}
Then, configure the sample processor in a scan builder as per the example below.
# scan_builder construction
scan_builder.sample_processor = S3SampleProcessor()
scan_result = scan_builder.build().execute()
Reroute to Google Big Query using existing credentials
This configuration uses the Big Query access credentials that Soda SQL uses. These credentials must have the appropriate service account and scopes in Big Query which give Soda SQL write permission on the table.
First, configure a SampleProcessor
according to the following example. Note that the client
parameter points to different objects for different warehouses.
import bigquery
from soda.scan.SampleProcessor
class BigQuerySampleProcessor(SampleProcessor):
# Override process function - conn/context
# (here only treating it as a bigquery client)
def process(context) → dict:
table_id = "your-project.your_dataset.your_table"
errors = conn.insert_rows_json(
table_id, rows, row_ids=[None] * len(rows_to_insert)
) # Make an API request.
if errors == []:
return { 'count': 50,
'columns': ['id', 'amount']
'message': f'Sample data is stored in {table_id}'}
else:
return {'message': 'Unable to save sample data to Bigquery'}
Then, configure the sample processor in a scan builder as per the example below.
# scan_builder construction
scan_builder.sample_processor = BigQuerySampleProcessor()
scan_result = scan_builder.build().execute()
Reroute to Google Big Query using separate credentials
This configuration does not use the Big Query access credentials that Soda SQL uses. The separate credentials must have the appropriate service account and scopes in Big Query which give Soda SQL write permission on the table.
First, configure a SampleProcessor
according to the following example. Note that the client
parameter points to different objects for different warehouses.
import json
import bigquery
from soda.scan.SampleProcessor
class BigQuerySampleProcessor(SampleProcessor):
# Override process function
# context: sql, connection, sample_reference
def process(context) → dict:
table_schema = {
## Define Schema for the sample dataset.table
}
project_id = '<my_project>'
dataset_id = '<my_dataset>'
table_id = '<my_table>'
client = bigquery.Client(project = project_id)
dataset = client.dataset(dataset_id)
table = dataset.table(table_id)
try:
json_object = json.loads(rows)
job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
job_config.schema = format_schema(table_schem
job = client.load_table_from_json(json_object, table, job_config = job_config)
job.result()
except GoogleAPICallError: # or TimeoutError or TypeError
return {'message': 'Unable to save sample data to Bigquery'}
return { 'count': 42,
'columns': ['id', 'amount']
'message': f'Sample data is stored in {table_id}'}
Then, configure the samples processor in a scan builder as per the example below.
# scan_builder construction
scan_builder = ScanBuilder()
scan_builder.sample_processor = BigQuerySampleProcessor()
Go further
- Read more about failed row samples in Soda Cloud.
- Sign up for a Soda Cloud account.
- Create monitors in Soda Cloud.
- Learn more about Soda Cloud architecture.
- Need help? Join the Soda community on Slack.
Last modified on 01-Jul-22