Use your command-line interface to connect Soda SQL to your database, prepare default tests that will surface “bad” data, then run your first scan in a few minutes.
In the context of Soda SQL, a warehouse is a type of data source that represents a SQL engine or database such as Snowflake, Amazon Redshift, or PostgreSQL. If you do not have access to a warehouse on your system, you can use Docker to build a sample PostgreSQL warehouse so that you can set up your Soda SQL CLI tool and see it in action.
All the instructions below reference this sample warehouse in the commands.
- From your command-line interface, execute the following to build a containerized PostgreSQL warehouse. Note that the
-voption connects to a location on your local drive in which Soda SQL will create your warehouse directory file further in this tutorial.
docker run --name soda_sql_tutorial_db --rm -d \ -p 5432:5432 \ -v soda_sql_tutorial_postgres:/var/lib/postgresql/data:rw \ -e POSTGRES_USER=sodasql \ -e POSTGRES_DB=sodasql \ -e POSTGRES_HOST_AUTH_METHOD=trust \ postgres:9.6.17-alpine
- Load sample data into your warehouse.
docker exec soda_sql_tutorial_db \ sh -c "wget -qO - https://raw.githubusercontent.com/sodadata/soda-sql/main/tests/demo/demodata.sql | psql -U sodasql -d sodasql"
The instructions below reference the sample warehouse in the commands. There are many install packages for Soda SQL that correspond to different warehouse types; this tutorial uses PostgreSQL. Customize the example commands to use your own PostgreSQL warehouse connection details, if you like.
- From your command-line interface, verify your installation of Soda SQL using the
$ soda Usage: soda [OPTIONS] COMMAND [ARGS]...
- Create, then navigate to a new Soda SQL warehouse directory. The example below creates a directory named
$ mkdir soda_sql_tutorial $ cd soda_sql_tutorial
- Use the
soda create postgrescommand to create and pre-populate two files that enable you to configure connection details for Soda SQL to access your warehouse:
warehouse.ymlfile which stores access details for your warehouse (read more)
env_vars.ymlfile which securely stores data source login credentials (read more)
$ soda create postgres -d sodasql -u sodasql -w soda_sql_tutorial
| Soda CLI version ... | Creating warehouse YAML file warehouse.yml ... | Creating /Users/tom/.soda/env_vars.yml with example env vars in section soda_sql_tutorial | Review warehouse.yml by running command | cat warehouse.yml | Review section soda_sql_tutorial in ~/.soda/env_vars.yml by running command | cat ~/.soda/env_vars.yml | Then run the soda analyze command
- Optionally, use the following commands to review the contents of the two YAML files you created. You do not need to adjust any contents as Soda SQL has already configured the warehouse connection details.
- Use the
soda analyzecommand to get Soda SQL to sift through the contents of your warehouse and automatically prepare a scan YAML file for each table. Soda SQL puts the YAML files in a new
/tablesdirectory in the warehouse directory you created. Read more about scan YAML files.
| Analyzing warehouse.yml ... | Querying warehouse for tables | Creating tables directory tables | Executing SQL query: SELECT table_name FROM information_schema.tables WHERE lower(table_schema)='public' | SQL took 0:00:00.007998 | Creating tables/demodata.yml ... | Executing SQL query: ... | SQL took 0:00:00.000647 | Next run 'soda scan warehouse.yml tables/demodata.yml' to calculate measurements and run tests
- Use the following command to review the contents of the new scan YAML file that Soda SQL created and named
table_name: demodata metrics: - row_count - missing_count - missing_percentage - values_count - values_percentage - invalid_count - invalid_percentage - min_length - max_length - avg_length - min - max - avg - sum - variance - stddev tests: - row_count > 0 columns: id: valid_format: uuid tests: - invalid_percentage == 0 feepct: valid_format: number_percentage tests: - invalide_percentage == 0
Note the three tests that Soda SQL configured in
demodata.yml. When it created this file, Soda SQL pre-populated it with the
metric configurations it deemed useful based on the data in the table it analyzed. Read more about the Anatomy of the scan YAML file.
If, when you run
soda analyze you get an an authentication error, check to see if you have another instance of postgres already running on port 5432. If so, try stopping or uninstalling the postgres instance, then run
soda analyze again.
- Use the
soda scancommand to run tests against the data in the demodata warehouse. As input, the command requires the name of the warehouse to scan, and the filepath and name of the table in the warehouse.
soda scan warehouse.yml tables/demodata.yml
- Examine the output of the command, in particular the lines at the bottom that indicate the results of the tests Soda SQL ran against your data. In this example, all the tests passed which indicates that there are no issues with the data.
| Soda CLI version ... | Scanning demodata in ./soda_sql_tutorial ... | Environment variable POSTGRES_PASSWORD is not set | Executing SQL query: SELECT column_name, data_type, is_nullable FROM information_schema.columns WHERE lower(table_name) = 'demodata' AND table_catalog = 'sodasql' AND table_schema = 'public' | SQL took 0:00:00.029199 | 6 columns: | id character varying | name character varying | size integer | date date | feepct character varying | country character varying | Query measurement: schema = id character varying, name character varying, size integer, date date, feepct character varying, country character varying | Executing SQL query: SELECT COUNT(*), COUNT(id), MIN(LENGTH(id)), MAX(LENGTH(id)), COUNT(name), MIN(LENGTH(name)), MAX(LENGTH(name)), COUNT(size), ... | missing_count(country) = 0 | values_percentage(country) = 100.0 | All good. 38 measurements computed. No tests failed.
- If you used Docker to create a sample PostgreSQL warehouse for this tutorial, be sure to execute the following commands to stop the container.
$ docker stop soda_sql_tutorial_db $ docker volume rm soda_sql_tutorial_postgres
- Consult Configure Soda SQL for details on setting up a non-PostgreSQL version of Soda SQL.
- Learn more about How Soda SQL works.
- Learn more about the scan YAML file and how to run scans.
- Need help? Join the Soda community on Slack.
Last modified on 16-Jul-21