A scan is a Soda SQL CLI command that uses SQL queries to extract information about data in a dataset.
Instead of laboriously accessing your data source and then manually defining SQL queries to analyze the data in datasets, you can use a much simpler Soda SQL scan. First, you configure scan metrics and tests in a scan YAML file, then Soda SQL uses the input from that file to prepare, then run SQL queries against your data.
You need to create a scan YAML file for every dataset in your data source that you want to scan. If you have 20 datasets in your data source, you need 20 YAML files, each corresponding to a single dataset.
You can create scan YAML files yourself, but the CLI command
soda analyze sifts through the contents of your data source and automatically prepares a scan YAML file for each dataset. Soda SQL puts the YAML files in a
/tables directory in your warehouse directory. (If you have not already created a warehouse YAML file, refer to the instructions in Warehouse YAML.)
In your command-line interface, navigate to the directory that contains your
warehouse.yml file, then execute the following:
$ soda analyze
| Analyzing warehouse.yml ... | Querying warehouse for tables | Creating tables directory tables | Executing SQL query: SELECT table_name FROM information_schema.tables WHERE lower(table_schema)='public' | SQL took 0:00:00.008511 | Executing SQL query: SELECT column_name, data_type, is_nullable FROM information_schema.columns WHERE lower(table_name) = 'demodata' AND table_catalog = 'sodasql' AND table_schema = 'public' | SQL took 0:00:00.013018 | Executing SQL query: ... | SQL took 0:00:00.008593 | Creating tables/demodata.yml ... | Next run 'soda scan warehouse.yml tables/demodata.yml' to calculate measurements and run tests
In the above example, Soda SQL created a scan YAML file named
demodata.yml and put it in the
If you decide to create your own scan YAML files manually, best practice dictates that you name the YAML file using the same name as the dataset in your warehouse.
Tip: Use the
soda analyze --help command to review options you can include to customize the analysis. For example, use
soda analyze --include customer to analyze only the dataset named
customer in your data source.
When it creates your scan YAML file, Soda SQL pre-populates it with the
metric configurations it deems useful based on the data in the dataset it analyzed. You can keep those configurations intact and use them to run your scans, or you can adjust or add to them to fine-tune the tests Soda SQL runs on your data.
The following describes the contents of a scan YAML file that Soda SQL created and pre-populated.
1 - The value of table_name identifies a dataset in your data source. If you were writing a SQL query, it would be the value you would supply for your
2 - A metric is a property of the data in your data source. A measurement is the value for a metric that Soda SQL checks against during a scan. For example, in the test
row_count = 5,
row_count is the metric and
5 is the measurement.
3 - A test is a Python expression that, during a scan, checks metrics to see if they match the parameters defined for a measurement. As a result of a scan, a test either passes or fails.
For example, the test
row_count > 0 checks to see if the dataset has at least one row. If the test passes, it means the dataset has at least one row; if the test fails, it means the dataset has no rows, which means that it is empty. Tests in this part of the YAML file apply to all columns in the dataset. A single Soda SQL scan can run many tests on the contents of the whole dataset.
4 - A column identifies a specific column in your dataset. Use column names to configure tests against individual columns in the dataset. A single Soda SQL scan can run many tests in many columns.
feepct are column names that identify specific columns in the dataset this scan YAML file scans.
6 - The value of the column configuration key
valid_format identifies the only form of data in the column that Soda SQL recognizes as valid during a scan. In this case, any row in the
id column that contains data that is in UUID format (universally unique identifier) is valid; anything else is invalid.
7 - Same as above, except the tests in the
column section of the YAML file run only against the contents of the single, identified column. In this case, the test
invalid_percentage == 0 checks to see if all rows in the
id column contain data in a valid format. If the test passes, it means that 0% of the rows contain data that is invalid; if the test fails, it means that more than 0% of the rows contain invalid data, which is data that is in non-UUID format.
The table below describes all of the top level configuration keys you can use to customize your scan.
| ||The section of the scan YAML file in which you define tests and metrics that apply to individual columns. See Column metrics for configuration details.||optional|
| ||Identifies the columns against which Soda SQL does NOT execute tests during a scan. Identifies columns by name. Use |
On a related note, you can specify datasets to exclude or include during
| ||A SQL expression that Soda SQL adds to the ||optional|
| ||Defines the maximum number of elements for the ||optional|
| ||A list of all the built-in metrics that you can use to configure a scan. This list includes both dataset and column metrics. See Configure metrics in Soda SQL for configuration details.||optional|
| ||Defines the maximum number of elements for the ||optional|
| ||Defines a threshold on the number of sample rows that Soda SQL sends to Soda Cloud. See Send sample data to Soda Cloud and Send failed rows to Soda Cloud.||optional|
| ||The section of the scan YAML file in which you define custom sql queries to run during a scan. You can apply ||optional|
| ||Identifies a dataset in your data source.||required|
table_name: orders metrics: - row_count - missing_count - missing_percentage - ... excluded_columns: - discount - productid tests: - row_count > 0 columns: orderid: valid_format: uuid tests: - invalid_percentage <= 3
- Next, run a scan on the data in your warehouse.
- Learn how to specify the datasets you wish to include or exclude during
- Learn more about the warehouse YAML file.
- Learn how to configure metrics in your YAML files.
- Learn more about configuring tests.
- Reference the Data types that Soda SQL supports when it scans columns.
- Need help? Join the Soda community on Slack.
Last modified on 15-Sep-21