automatic import of python-dtspec

author: CoprDistGit <infra@openeuler.org> 2023-05-31 05:53:23 +0000
committer: CoprDistGit <infra@openeuler.org> 2023-05-31 05:53:23 +0000
commit: 863316c39e6849bffc18f96045838f19101f07e2 (patch)
tree: 548891afc85590ad4dcc52bf363e3af805ab7c01
parent: e27568c54925ef66d090a222fd507fb7afdc326d (diff)
3 files changed, 1706 insertions, 0 deletions
diff --git a/.gitignore b/.gitignore
index e69de29..d93596e 100644
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1 @@
+/dtspec-0.7.5.tar.gz
diff --git a/python-dtspec.spec b/python-dtspec.spec
new file mode 100644
index 0000000..1953c92
--- /dev/null
+++ b/python-dtspec.spec
@@ -0,0 +1,1704 @@
+%global _empty_manifest_terminate_build 0
+Name:		python-dtspec
+Version:	0.7.5
+Release:	1
+Summary:	dtspec - Data Test Spec
+License:	MIT
+URL:		https://github.com/inside-track/dtspec
+Source0:	https://mirrors.nju.edu.cn/pypi/web/packages/38/94/ebc9c6bf397744ec6bfad655cd2c298cbb4dd573ce98bd5a1150699a7d09/dtspec-0.7.5.tar.gz
+BuildArch:	noarch
+
+
+%description
+version: '0.1'
+description: HelloWorld - Simplest example of running dtspec
+# The names of sources and targets is arbitrary, but it's up to the user to determine
+# how they get mapped to/from their data transformation system.
+sources:
+  - source: raw_students
+targets:
+  - target: salutations
+````
+These define our inputs and outputs.  But we also need to define how to generate
+data for the input(s).  For that, we define a **factory**:
+````yaml
+factories:
+  - factory: SomeStudents
+    description: Minimal example of what some student records may look like
+    data:
+      - source: raw_students
+        # Tables written as a markdown table
+        table: |
+          | id | name   |
+          | -  | -      |
+          | 1  | Buffy  |
+          | 2  | Willow |
+````
+Lastly, we need to describe how we expect the data to look after it has been transformed.
+To do this, we define **scenarios** and **cases**.  Scenarios are collections of cases
+that share some common data factory or describe similar situations.  For now, our
+transform spec will just contain a single scenario and a single case:
+````yaml
+scenarios:
+  - scenario: Hello World
+    description: The simplest scenario
+    # All cases in this scenario will use this factory (which may be modified on case-by-case basis)
+    factory:
+        parents:
+          - SomeStudents
+    cases:
+      - case: HelloGang
+        description: Make sure we say hello to everyone
+        expected:
+          data:
+            - target: salutations
+              # The actual output may also contain the "name" field, but the expectation
+              # will ignore comparing any fields not listed in the expected table.
+              table: |
+                | id | salutation   |
+                | -  | -            |
+                | 1  | Hello Buffy  |
+                | 2  | Hello Willow |
+````
+That's it. See also the [full YAML spec](tests/hello_world.yml).
+Now that we've described the full transform spec, we need to use it.  The first step is to
+parse the YAML file, send it to the dtspec api, and have dtspec generate source data:
+````python
+import dtspec
+import yaml
+spec = yaml.safe_load(open("tests/hello_world.yml"))
+api = dtspec.api.Api(spec)
+api.generate_sources()
+````
+The specific steps taken at this point are going to be sensitive to the data transformation
+environment being used, but we'll stick with our Pandas transformations for the sake of this
+tutorial.  Given this, we can define a simple function that converts the source data returned
+from dtspec into Pandas dataframes:
+````python
+import pandas as pd
+def parse_sources(sources):
+    "Converts test data returned from dtspec api into Pandas dataframes"
+    return {
+        source_name: pd.DataFrame.from_records(data.serialize())
+        for source_name, data in sources.items()
+    }
+````
+We can then run those test Pandas dataframes through our data transformation function.
+````python
+sources_data = parse_sources(api.spec["sources"])
+actual_data = hello_world_transformer(**sources_data)
+````
+Next, we need to convert the output dataframes of the transformations, `actual_data`,
+back into a format that can be loaded into dtspec for comparison.  For Pandas,
+this function is:
+````python
+def serialize_actuals(actuals):
+    "Converts Pandas dataframe results into form needed to load dtspec api actuals"
+    return {
+        target_name: json.loads(dataframe.astype(str).to_json(orient="records"))
+        for target_name, dataframe in actuals.items()
+    }
+````
+It is loaded into dtspec using:
+````python
+serialized_actuals = serialize_actuals(actual_data)
+api.load_actuals(serialized_actuals)
+````
+Finally, dtspec can be called to run all of the expectations:
+````python
+api.assert_expectations()
+````
+Putting all of this together:
+````python
+spec = yaml.safe_load(open("tests/hello_world.yml"))
+api = dtspec.api.Api(spec)
+api.generate_sources()
+sources_data = parse_sources(api.spec["sources"])
+actual_data = hello_world_transformer(**sources_data)
+serialized_actuals = serialize_actuals(actual_data)
+api.load_actuals(serialized_actuals)
+````
+Try running the above code and changing either the YAML spec or the `hello_world_transformer`
+function and see how dtspec responds.
+### Hello World With Multiple Test Cases
+Running tests with multiple cases that reference the same data sources
+introduces a complicating factor. One of the reasons that makes
+it hard to build tests for ETL/ELT is the fact that many data
+transformation systems in use today have a high latency for even very
+small transformations.  For example, Redshift is a distributed RDBMS
+that can process billions of rows in minutes, millions of rows in
+seconds, thousands of rows in seconds, or 10s of rows in, well,
+seconds.  Given these latency issues, we don't want to have to rely on
+loading data into our system, running a test, clearing out the data,
+loading some more, running the next test, and so on as is often
+done when testing ORM-based applications like Rails or Django.
+dtspec seeks to minimize the number of requests on the data
+transformation system in order to deal with these latency issues.
+It does this by "stacking" the test data generated in each case
+and delivering back to the user all of this stacked data.  The user
+then loads this stacked data into their data transformation system
+**once**, runs the data transformations **once**, and then collects
+the resulting output **once**.
+Let's see how dtspec handles this in action.
+First, let's change our hello world data transformation a bit.  Instead of
+just saying hello to our heroes, let's say goodbye to any villians (as
+identified by a `clique` data field).
+````python
+def hello_world_multiple_transformer(raw_students):
+    def salutation(row):
+        if row["clique"] == "Scooby Gang":
+            return "Hello {}".format(row["name"])
+        return "Goodbye {}".format(row["name"])
+    salutations_df = raw_students.copy()
+    salutations_df["salutation"] = salutations_df.apply(salutation, axis=1)
+    return {"salutations": salutations_df}
+````
+While it would be possible to test saying hello or goodbye in a single
+case just by adding more records to the source data, we'll split it
+into two to demonstrate how multiple cases work.  Here's how the YAML would look:
+````yaml
+scenarios:
+  - scenario: Hello World With Multiple Cases
+    description: The simplest scenario
+    factory:
+      parents:
+        - SomeStudents
+    cases:
+      - case: HelloGang
+        description: Make sure we say hello to everyone
+        expected:
+          data:
+            - target: salutations
+              table: |
+                | id | name   | clique      | salutation   |
+                | -  | -      | -           | -            |
+                | 1  | Buffy  | Scooby Gang | Hello Buffy  |
+                | 2  | Willow | Scooby Gang | Hello Willow |
+      - case: GoodbyeVillians
+        description: Say goodbye to villians
+        # For this case, we tweak the factory defined for the scenario.
+        factory:
+          # The ids here might be the same as above.  However, these are just named
+          # references and get translated into unique ids when the source data
+          # is generated.
+          data:
+            - source: raw_students
+              table: |
+                | id | name     |
+                | -  | -        |
+                | 1  | Drusilla |
+                | 2  | Harmony  |
+              # Use values to populate a constant over all records
+              values:
+                - column: clique
+                  value: Vampires
+        expected:
+          data:
+            # Again, the ids here are not the actual ids sent to dtspec after performing
+            # the transformations.  They are just named references and dtspec
+            # keeps track of the relationship between the actual ids and the named ones.
+            - target: salutations
+              table: |
+                | id | name     | clique   | salutation       |
+                | -  | -        | -        | -                |
+                | 1  | Drusilla | Vampires | Goodbye Drusilla |
+                | 2  | Harmony  | Vampires | Goodbye Harmony  |
+````
+This won't quite work as is, because we're missing something.  We have
+two cases that describe variations on the source data `raw_students`
+and the output `salutations`.  dtspec collects the source data
+definitions from each case and stacks them into a single data source.
+The user then runs the transformations on that source and generates a
+single target to provide back to dtspec.  But dtspec has to know which record
+belongs to which case.  To do this, we have to define an
+**identifier** that tells dtspec which columns should be used to identify
+a record as belonging to a case.  A good identifier is often a primary
+key that uniquely defines a record, but it is not strictly required to
+be unique across all records.
+For this example, we'll define an identifier called "students" with a single
+**identifier attribute** called `id` that is a unique integer:
+````yaml
+identifiers:
+  - identifier: students
+    attributes:
+      - field: id
+        generator: unique_integer
+````
+We tell dtspec that this identifier is associated with the `id` columns of both
+the source and the target via:
+````yaml
+sources:
+  - source: raw_students
+    identifier_map:
+      - column: id
+        identifier:
+          name: students
+          attribute: id
+targets:
+  - target: salutations
+    identifier_map:
+      - column: id
+        identifier:
+          name: students
+          attribute: id
+````
+With the sources and targets with identifiers, the values we see in
+the source factories and target expectations are not the values that
+are actually used in the data.  Instead, they are simply **named
+refereces**.  For example, in the "HelloGang" case, `id=1` belongs to
+Buffy and `id=2` belongs to Willow.  But when dtspec generates the source
+data, the actual values may be 3 and 9, or 4 and 7, or something else.
+Unique values are not generated in any deterministic manner -- each
+run of dtspec can give a diferent set.  dtspec only guarantees that the
+each named reference will be a unique integer (via the `generator`
+defined in the `identifier` section).
+Futhermore, in the second case called "GoodbyeVillians", we see that
+`id=1` belongs to Drusilla and `id=2` belongs to Harmony.  dtspec will
+generate unique values for this case as well, and they **will not**
+conflict with the values generated for the first case.  So dtspec will pass
+back to the user 4 total records (Buffy, Willow, Drusilla, Harmony) with 4
+different ids
+With the [full YAML spec](tests/hello_world_multiple_cases.yml) defined, we can
+run the assertions in the same fashion as the the earlier example
+````python
+spec = yaml.safe_load(open("tests/hello_world_multiple_cases.yml"))
+api = dtspec.api.Api(spec)
+api.generate_sources()
+sources_data = parse_sources(api.spec["sources"])
+actual_data = hello_world_multiple_transformer(**sources_data)
+serialized_actuals = serialize_actuals(actual_data)
+api.load_actuals(serialized_actuals)
+api.assert_expectations()
+````
+#### Embedded Identifiers
+It is also possible to embed identifiers in the value of a particular column.
+For example, suppose our `salutation` column said hello to the `id` instead
+of the name of the person.  To make this work, we have to put a particular
+string pattern in the column that indicates the name of the identifier, the
+attribute, and the named id - `{identifier.attribute[named_id]}`.  The
+yaml spec would look like:
+````yaml
+      - case: HelloGang
+        description: Make sure we say hello to everyone
+        expected:
+          data:
+            - target: salutations
+              table: |
+                | id | name   | clique      | salutation             |
+                | -  | -      | -           | -                      |
+                | 1  | Buffy  | Scooby Gang | Hello {students.id[1]} |
+                | 2  | Willow | Scooby Gang | Hello {students.id[2]} |
+````
+The [realistic example](tests/realistic.yml) discussed below has another example
+of using embedded identifiers.
+**Note** that embedded identifiers cannot be used to associate records
+with cases.  A target must have at least one column listed in the
+`identifier_map` section.
+### A More Realistic Example
+Finally, let's example a more realistic example that one might
+encounter when building a data warehouse.  In these situations, we'll
+have multiple sources, targets, scenarios, and cases.  Now suppose we
+have a students table, where every student belongs to a school and
+takes 0 to many classes.  Our goal is to create one denormalized table
+that combines all of these data sources into one table.  Additionally,
+we want to create a table that aggregates all of our students to give
+a count of the students per school.  In Pandas, the data transformation
+might look like:
+````python
+def realistic_transformer(raw_students, raw_schools, raw_classes, dim_date):
+    student_schools = raw_students.rename(
+        columns={"id": "student_id", "external_id": "card_id"}
+    ).merge(
+        raw_schools.rename(columns={"id": "school_id", "name": "school_name"}),
+        how="inner",
+        on="school_id",
+    )
+    student_classes = student_schools.merge(
+        raw_classes.rename(columns={"name": "class_name"}),
+        how="inner",
+        on="student_id",
+    ).merge(
+        dim_date.rename(columns={"date": "start_date"}), how="left", on="start_date"
+    )
+    student_classes["student_class_id"] = student_classes.apply(
+        lambda row: "-".join([str(row["card_id"]), str(row["class_name"])]), axis=1
+    )
+    students_per_school = (
+        student_schools.groupby(["school_name"])
+        .size()
+        .to_frame(name="number_of_students")
+        .reset_index()
+    )
+    return {
+        "student_classes": student_classes,
+        "students_per_school": students_per_school,
+    }
+````
+Given the [full YAML spec](tests/realistic.yml) defined, we can again run
+the data assertions using a familiar pattern:
+````python
+spec = yaml.safe_load(open("tests/realistic.yml"))
+api = dtspec.api.Api(spec)
+api.generate_sources()
+sources_data = parse_sources(api.spec["sources"])
+actual_data = hello_world_multiple_transformer(**sources_data)
+serialized_actuals = serialize_actuals(actual_data)
+api.load_actuals(serialized_actuals)
+api.assert_expectations()
+````
+## dbt support
+dtspec also contains a CLI tool that can facilitate using it with [dbt](https://getdbt.com).
+The CLI tools helps you set up a test environment, run dbt in that environment, and
+execute the dbt tests.  The CLI tool currently only works for Postgres and Snowflake dbt
+projects.
+See the [dbt-container-skeleton](https://github.com/gnilrets/dbt-container-skeleton) for a
+working example.
+### dtspec CLI Config
+All of the dtspec files should be placed in a subdirectory of your dbt project: `dbt/dtspec`.
+The first thing to set up for the dtspec CLI is the configuration file, which should
+be placed in `dtspec/config.yml`.  The configuration file tells dtspec how to recreate
+the table schemas in a test environment, where to recreate the table schemas, and where
+to find the results of a dbt run.  Here is an example:
+````yaml
+# A target environment is where the output of data transformations appear.
+# Typically, there will only be on target environment.
+target_environments:
+  # The target environment IS NOT your production environment.  It needs to be a separate
+  # database where dbt will run against the test data that dtspec generates.  The name
+  # of this environment needs to be the same as a target defined in dbt profiles.yml (in this case `dtspec`)
+  dtspec:
+    # Field names here follow the same conventions as dbt profiles.yml (https://docs.getdbt.com/dbt-cli/configure-your-profile)
+    type: postgres
+    host: "{{ env_var('POSTGRES_HOST') }}"
+    port: 5432
+    user: "{{ env_var('POSTGRES_USER') }}"
+    password: "{{ env_var('POSTGRES_PASSWORD') }}"
+    dbname: "{{ env_var('POSTGRES_DBNAME') }}_dtspec"
+# A source environment is where source data is located.  It may be in the same database
+# as the target environment or it may be different if the data warehouse supports it (e.g., Snowflake).
+# It is also possible to define several source environments if your source data is spread
+# across multiple databases.
+source_environments:
+  raw:
+    # Use `tables` to specify source tables that need to be present to run tests.
+    tables:
+      # `wh_raw` is the name of a namespace (aka schema) in the `raw` source environment
+      wh_raw:
+        # tables may be listed indivdually (or, use `wh_raw: '*'` to indicate all tables within the `wh_raw` namespace)
+        - raw_customers
+        - raw_orders
+        - raw_payments
+    # In order to run tests, we need to replicate the table schemas in the test environment.
+    # The schema section here contains credentials for a database where those tables are defined.
+    # This is likely a production database (in your warehouse), or is a production replica.
+    # dtspec only uses this database to read reflect the table schemas (via `dtspec db --fetch-schemas`).
+    schema:
+      type: postgres
+      host: "{{ env_var('POSTGRES_HOST') }}"
+      port: 5432
+      user: "{{ env_var('POSTGRES_USER') }}"
+      password: "{{ env_var('POSTGRES_PASSWORD') }}"
+      dbname: "{{ env_var('POSTGRES_DBNAME') }}"
+    # The test section contains credentials for a database where test data will be created.
+    # Data in this database is destroyed and rebuilt for every run of dtspec and SHOULD NOT be
+    # the same as the schema credentials defined above.
+    test:
+      type: postgres
+      host: "{{ env_var('POSTGRES_HOST') }}"
+      port: 5432
+      user: "{{ env_var('POSTGRES_USER') }}"
+      password: "{{ env_var('POSTGRES_PASSWORD') }}"
+      dbname: "{{ env_var('POSTGRES_DBNAME') }}_dtspec"
+  # Pretending snapshots are in a different database because Postgres doesn't support cross-db queries.
+  # This is how you would do it if snapshots were in a different database than other raw source data.
+  snapshots:
+    tables:
+      snapshots: '*'
+    schema:
+      type: postgres
+      host: "{{ env_var('POSTGRES_HOST') }}"
+      port: 5432
+      user: "{{ env_var('POSTGRES_USER') }}"
+      password: "{{ env_var('POSTGRES_PASSWORD') }}"
+      dbname: "{{ env_var('POSTGRES_DBNAME') }}"
+    test:
+      type: postgres
+      host: "{{ env_var('POSTGRES_HOST') }}"
+      port: 5432
+      user: "{{ env_var('POSTGRES_USER') }}"
+      password: "{{ env_var('POSTGRES_PASSWORD') }}"
+      dbname: "{{ env_var('POSTGRES_DBNAME') }}_dtspec"
+````
+### Test environment setup
+Once the configuration file has been defined, the next step is to fetch/reflect schemas for
+the source tables.  From the `dbt` directory, run the following CLI command:
+    dtspec db --fetch-schemas
+This will query all of the databases defined in the `schema` section of the source
+environments defined in `dtspec/config.yml`, and create table schema files in `dtspec/schemas`.
+The files in this directory should be committed to source control and updated whenever
+your source data changes (in so much as it would affect the dtspec tests).
+Next, initialize the test databases defined in the `test` section of the source
+environments defined in `dtspec/config.yml` with the CLI command
+    dtspec db --init-test-db
+This will create empty source tables in your test databases, ready to be loaded with test data.
+### Executing tests
+In order to use dtspec with dbt, spec files must make use of the `dbt_source` and `dbt_ref`
+Jinja functions.  These are analogous to the dbt `source` and `ref` functions.  dtspec
+will compile your dbt project and use the `dbt/target/manifest.json` file to resolve the names
+of sources and targets that you want to test.  For example, the SomeStudents factory
+would be written as follows if this were a dbt project:
+````yaml
+factories:
+  - factory: SomeStudents
+    data:
+      - source: {{ dbt_source('raw', 'raw_students') }}
+        table: |
+          | id | name   |
+          | -  | -      |
+          | 1  | Buffy  |
+          | 2  | Willow |
+````
+and an expectation would be:
+````yaml
+    cases:
+      - case: HelloGang
+        expected:
+          data:
+            - target: {{ dbt_ref('salutations') }}
+              table: |
+                | id | salutation   |
+                | -  | -            |
+                | 1  | Hello Buffy  |
+                | 2  | Hello Willow |
+````
+With these references set, dtspec tests can be executed via the CLI command:
+    dtspec test-dbt
+This command will do the following:
+1. It will first compile your dbt project.  If your dbt code does not change between
+   dtspec tests, you may skip this step by pass the `--partial-parse` argument.
+2. The dtspec spec files are compiled into a single document and dbt references are resolved.
+   The compiled dtspec document is output to `dtspec/compiled_specs.yml`, which does not
+   need to be saved to source control.
+3. Source data is generated and loaded into the test databases.
+4. dbt is executed against the test database.
+5. The models that dbt built in the target test environment are extracted.  These are the "actuals".
+6. The actuals are compared with the expected data as specified in the dtspec specs.
+The `test-dbt` command has several options that may be useful.  See `dtspec test-dbt -h` for a full
+list, but here are some noteworthy options:
+- `--models` specifies the models that dbt should run, using standard dbt model selection syntax.
+- `--scenarios` is used to restrict the number of scenarios that are tested.  The argument is a
+  regular expression that will match on the compiled Scenario name.  This can be used
+  in combination with the `--models` command to only run those tests and models that you're
+  concerned with.
+### Additonal CLI notes
+#### Log level
+If you want to see more detailed loggin information, set the `DTSPEC_LOG_LEVEL` environment
+variable (options are DEBUG, INFO, WARN, and ERROR).  For example:
+    DTSPEC_LOG_LEVEL=INFO dtspec test-dbt
+#### Project location
+If you really don't want to put dtspec in the dbt project directory you can override the
+default by setting `DTSPEC_ROOT` and `DBT_ROOT` environment variables that point
+to the root path of these projects.
+#### Special Values
+When dtspec is run via the CLI, it recognizes nulls and booleans in the spec files.  To
+indicate these kinds of values in a dtspec spec, use `{NULL}`, `{True}`, and `{False}`.
+For example:
+````yaml
+    cases:
+      - case: HelloGang
+        expected:
+          data:
+            - target: {{ dbt_ref('salutations') }}
+              table: |
+                | id | salutation   | is_witch |
+                | -  | -            | -        |
+                | 1  | Hello Buffy  | {False}  |
+                | 2  | Hello Willow | {True}   |
+                | 3  | Hello NA     | {NULL}   |
+````
+#### Jinja context
+When writing spec files that will be parsed with the dtspec CLI, the following functions
+are available in the jinja context:
+* `datetime` -- This is the [Python datetime.datetime type](https://docs.python.org/3/library/datetime.html)
+* `date` -- This is the [Python datetime.date type](https://docs.python.org/3/library/datetime.html)
+* `relativedelta` -- This is the [Python relativedelta type0](https://dateutil.readthedocs.io/en/stable/relativedelta.html)
+* `UTCNOW` -- The UTC datetime value at the time the specs are parsed
+* `TODAY` -- The current UTC date value at the time the specs are parsed
+* `YESTERDAY` -- Yesterday's date
+* `TOMORROW` -- Tomorrow's date
+* `dbt_source` -- Used to reference dbt sources
+* `dbt_ref` -- Used to reference dbt models
+Some example of using these functions:
+    - source: raw_products
+       table: |
+        | export_time                          | file                    | product_id | product_name |
+        | -                                    | -                       | -          | -            |
+        | {{ YESTERDAY }}                      | products-2021-01-06.csv | milk       | Milk         |
+        | {{ TODAY - relativedelta(days=5) }}  | products-2021-01-02.csv | milk       | Milk         |
+## Additional notes about dtspec
+* At the moment, all source data values are generated as strings.  It
+  is up to the the user to enforce data types suitable to their data
+  transformation system.  Note that the dtspec dbt CLI commands handle this
+  for Postgres and Snowflake warehouses.
+* Additionally, data expectations are stringified prior to running assertions.
+## Contributing
+We welcome contributors!  Please submit any suggests or pull requests in Github.
+### Developer setup
+Create an appropriate python environment.  I like [miniconda](https://conda.io/miniconda.html),
+but use whatever you like:
+    conda create --name dtspec python=3.8
+    conda activate dtspec
+Then install pip packages
+    pip install pip-tools
+    pip install --ignore-installed -r requirements.txt
+run tests via
+    inv test
+and the linter via
+    inv lint
+
+%package -n python3-dtspec
+Summary:	dtspec - Data Test Spec
+Provides:	python-dtspec
+BuildRequires:	python3-devel
+BuildRequires:	python3-setuptools
+BuildRequires:	python3-pip
+%description -n python3-dtspec
+version: '0.1'
+description: HelloWorld - Simplest example of running dtspec
+# The names of sources and targets is arbitrary, but it's up to the user to determine
+# how they get mapped to/from their data transformation system.
+sources:
+  - source: raw_students
+targets:
+  - target: salutations
+````
+These define our inputs and outputs.  But we also need to define how to generate
+data for the input(s).  For that, we define a **factory**:
+````yaml
+factories:
+  - factory: SomeStudents
+    description: Minimal example of what some student records may look like
+    data:
+      - source: raw_students
+        # Tables written as a markdown table
+        table: |
+          | id | name   |
+          | -  | -      |
+          | 1  | Buffy  |
+          | 2  | Willow |
+````
+Lastly, we need to describe how we expect the data to look after it has been transformed.
+To do this, we define **scenarios** and **cases**.  Scenarios are collections of cases
+that share some common data factory or describe similar situations.  For now, our
+transform spec will just contain a single scenario and a single case:
+````yaml
+scenarios:
+  - scenario: Hello World
+    description: The simplest scenario
+    # All cases in this scenario will use this factory (which may be modified on case-by-case basis)
+    factory:
+        parents:
+          - SomeStudents
+    cases:
+      - case: HelloGang
+        description: Make sure we say hello to everyone
+        expected:
+          data:
+            - target: salutations
+              # The actual output may also contain the "name" field, but the expectation
+              # will ignore comparing any fields not listed in the expected table.
+              table: |
+                | id | salutation   |
+                | -  | -            |
+                | 1  | Hello Buffy  |
+                | 2  | Hello Willow |
+````
+That's it. See also the [full YAML spec](tests/hello_world.yml).
+Now that we've described the full transform spec, we need to use it.  The first step is to
+parse the YAML file, send it to the dtspec api, and have dtspec generate source data:
+````python
+import dtspec
+import yaml
+spec = yaml.safe_load(open("tests/hello_world.yml"))
+api = dtspec.api.Api(spec)
+api.generate_sources()
+````
+The specific steps taken at this point are going to be sensitive to the data transformation
+environment being used, but we'll stick with our Pandas transformations for the sake of this
+tutorial.  Given this, we can define a simple function that converts the source data returned
+from dtspec into Pandas dataframes:
+````python
+import pandas as pd
+def parse_sources(sources):
+    "Converts test data returned from dtspec api into Pandas dataframes"
+    return {
+        source_name: pd.DataFrame.from_records(data.serialize())
+        for source_name, data in sources.items()
+    }
+````
+We can then run those test Pandas dataframes through our data transformation function.
+````python
+sources_data = parse_sources(api.spec["sources"])
+actual_data = hello_world_transformer(**sources_data)
+````
+Next, we need to convert the output dataframes of the transformations, `actual_data`,
+back into a format that can be loaded into dtspec for comparison.  For Pandas,
+this function is:
+````python
+def serialize_actuals(actuals):
+    "Converts Pandas dataframe results into form needed to load dtspec api actuals"
+    return {
+        target_name: json.loads(dataframe.astype(str).to_json(orient="records"))
+        for target_name, dataframe in actuals.items()
+    }
+````
+It is loaded into dtspec using:
+````python
+serialized_actuals = serialize_actuals(actual_data)
+api.load_actuals(serialized_actuals)
+````
+Finally, dtspec can be called to run all of the expectations:
+````python
+api.assert_expectations()
+````
+Putting all of this together:
+````python
+spec = yaml.safe_load(open("tests/hello_world.yml"))
+api = dtspec.api.Api(spec)
+api.generate_sources()
+sources_data = parse_sources(api.spec["sources"])
+actual_data = hello_world_transformer(**sources_data)
+serialized_actuals = serialize_actuals(actual_data)
+api.load_actuals(serialized_actuals)
+````
+Try running the above code and changing either the YAML spec or the `hello_world_transformer`
+function and see how dtspec responds.
+### Hello World With Multiple Test Cases
+Running tests with multiple cases that reference the same data sources
+introduces a complicating factor. One of the reasons that makes
+it hard to build tests for ETL/ELT is the fact that many data
+transformation systems in use today have a high latency for even very
+small transformations.  For example, Redshift is a distributed RDBMS
+that can process billions of rows in minutes, millions of rows in
+seconds, thousands of rows in seconds, or 10s of rows in, well,
+seconds.  Given these latency issues, we don't want to have to rely on
+loading data into our system, running a test, clearing out the data,
+loading some more, running the next test, and so on as is often
+done when testing ORM-based applications like Rails or Django.
+dtspec seeks to minimize the number of requests on the data
+transformation system in order to deal with these latency issues.
+It does this by "stacking" the test data generated in each case
+and delivering back to the user all of this stacked data.  The user
+then loads this stacked data into their data transformation system
+**once**, runs the data transformations **once**, and then collects
+the resulting output **once**.
+Let's see how dtspec handles this in action.
+First, let's change our hello world data transformation a bit.  Instead of
+just saying hello to our heroes, let's say goodbye to any villians (as
+identified by a `clique` data field).
+````python
+def hello_world_multiple_transformer(raw_students):
+    def salutation(row):
+        if row["clique"] == "Scooby Gang":
+            return "Hello {}".format(row["name"])
+        return "Goodbye {}".format(row["name"])
+    salutations_df = raw_students.copy()
+    salutations_df["salutation"] = salutations_df.apply(salutation, axis=1)
+    return {"salutations": salutations_df}
+````
+While it would be possible to test saying hello or goodbye in a single
+case just by adding more records to the source data, we'll split it
+into two to demonstrate how multiple cases work.  Here's how the YAML would look:
+````yaml
+scenarios:
+  - scenario: Hello World With Multiple Cases
+    description: The simplest scenario
+    factory:
+      parents:
+        - SomeStudents
+    cases:
+      - case: HelloGang
+        description: Make sure we say hello to everyone
+        expected:
+          data:
+            - target: salutations
+              table: |
+                | id | name   | clique      | salutation   |
+                | -  | -      | -           | -            |
+                | 1  | Buffy  | Scooby Gang | Hello Buffy  |
+                | 2  | Willow | Scooby Gang | Hello Willow |
+      - case: GoodbyeVillians
+        description: Say goodbye to villians
+        # For this case, we tweak the factory defined for the scenario.
+        factory:
+          # The ids here might be the same as above.  However, these are just named
+          # references and get translated into unique ids when the source data
+          # is generated.
+          data:
+            - source: raw_students
+              table: |
+                | id | name     |
+                | -  | -        |
+                | 1  | Drusilla |
+                | 2  | Harmony  |
+              # Use values to populate a constant over all records
+              values:
+                - column: clique
+                  value: Vampires
+        expected:
+          data:
+            # Again, the ids here are not the actual ids sent to dtspec after performing
+            # the transformations.  They are just named references and dtspec
+            # keeps track of the relationship between the actual ids and the named ones.
+            - target: salutations
+              table: |
+                | id | name     | clique   | salutation       |
+                | -  | -        | -        | -                |
+                | 1  | Drusilla | Vampires | Goodbye Drusilla |
+                | 2  | Harmony  | Vampires | Goodbye Harmony  |
+````
+This won't quite work as is, because we're missing something.  We have
+two cases that describe variations on the source data `raw_students`
+and the output `salutations`.  dtspec collects the source data
+definitions from each case and stacks them into a single data source.
+The user then runs the transformations on that source and generates a
+single target to provide back to dtspec.  But dtspec has to know which record
+belongs to which case.  To do this, we have to define an
+**identifier** that tells dtspec which columns should be used to identify
+a record as belonging to a case.  A good identifier is often a primary
+key that uniquely defines a record, but it is not strictly required to
+be unique across all records.
+For this example, we'll define an identifier called "students" with a single
+**identifier attribute** called `id` that is a unique integer:
+````yaml
+identifiers:
+  - identifier: students
+    attributes:
+      - field: id
+        generator: unique_integer
+````
+We tell dtspec that this identifier is associated with the `id` columns of both
+the source and the target via:
+````yaml
+sources:
+  - source: raw_students
+    identifier_map:
+      - column: id
+        identifier:
+          name: students
+          attribute: id
+targets:
+  - target: salutations
+    identifier_map:
+      - column: id
+        identifier:
+          name: students
+          attribute: id
+````
+With the sources and targets with identifiers, the values we see in
+the source factories and target expectations are not the values that
+are actually used in the data.  Instead, they are simply **named
+refereces**.  For example, in the "HelloGang" case, `id=1` belongs to
+Buffy and `id=2` belongs to Willow.  But when dtspec generates the source
+data, the actual values may be 3 and 9, or 4 and 7, or something else.
+Unique values are not generated in any deterministic manner -- each
+run of dtspec can give a diferent set.  dtspec only guarantees that the
+each named reference will be a unique integer (via the `generator`
+defined in the `identifier` section).
+Futhermore, in the second case called "GoodbyeVillians", we see that
+`id=1` belongs to Drusilla and `id=2` belongs to Harmony.  dtspec will
+generate unique values for this case as well, and they **will not**
+conflict with the values generated for the first case.  So dtspec will pass
+back to the user 4 total records (Buffy, Willow, Drusilla, Harmony) with 4
+different ids
+With the [full YAML spec](tests/hello_world_multiple_cases.yml) defined, we can
+run the assertions in the same fashion as the the earlier example
+````python
+spec = yaml.safe_load(open("tests/hello_world_multiple_cases.yml"))
+api = dtspec.api.Api(spec)
+api.generate_sources()
+sources_data = parse_sources(api.spec["sources"])
+actual_data = hello_world_multiple_transformer(**sources_data)
+serialized_actuals = serialize_actuals(actual_data)
+api.load_actuals(serialized_actuals)
+api.assert_expectations()
+````
+#### Embedded Identifiers
+It is also possible to embed identifiers in the value of a particular column.
+For example, suppose our `salutation` column said hello to the `id` instead
+of the name of the person.  To make this work, we have to put a particular
+string pattern in the column that indicates the name of the identifier, the
+attribute, and the named id - `{identifier.attribute[named_id]}`.  The
+yaml spec would look like:
+````yaml
+      - case: HelloGang
+        description: Make sure we say hello to everyone
+        expected:
+          data:
+            - target: salutations
+              table: |
+                | id | name   | clique      | salutation             |
+                | -  | -      | -           | -                      |
+                | 1  | Buffy  | Scooby Gang | Hello {students.id[1]} |
+                | 2  | Willow | Scooby Gang | Hello {students.id[2]} |
+````
+The [realistic example](tests/realistic.yml) discussed below has another example
+of using embedded identifiers.
+**Note** that embedded identifiers cannot be used to associate records
+with cases.  A target must have at least one column listed in the
+`identifier_map` section.
+### A More Realistic Example
+Finally, let's example a more realistic example that one might
+encounter when building a data warehouse.  In these situations, we'll
+have multiple sources, targets, scenarios, and cases.  Now suppose we
+have a students table, where every student belongs to a school and
+takes 0 to many classes.  Our goal is to create one denormalized table
+that combines all of these data sources into one table.  Additionally,
+we want to create a table that aggregates all of our students to give
+a count of the students per school.  In Pandas, the data transformation
+might look like:
+````python
+def realistic_transformer(raw_students, raw_schools, raw_classes, dim_date):
+    student_schools = raw_students.rename(
+        columns={"id": "student_id", "external_id": "card_id"}
+    ).merge(
+        raw_schools.rename(columns={"id": "school_id", "name": "school_name"}),
+        how="inner",
+        on="school_id",
+    )
+    student_classes = student_schools.merge(
+        raw_classes.rename(columns={"name": "class_name"}),
+        how="inner",
+        on="student_id",
+    ).merge(
+        dim_date.rename(columns={"date": "start_date"}), how="left", on="start_date"
+    )
+    student_classes["student_class_id"] = student_classes.apply(
+        lambda row: "-".join([str(row["card_id"]), str(row["class_name"])]), axis=1
+    )
+    students_per_school = (
+        student_schools.groupby(["school_name"])
+        .size()
+        .to_frame(name="number_of_students")
+        .reset_index()
+    )
+    return {
+        "student_classes": student_classes,
+        "students_per_school": students_per_school,
+    }
+````
+Given the [full YAML spec](tests/realistic.yml) defined, we can again run
+the data assertions using a familiar pattern:
+````python
+spec = yaml.safe_load(open("tests/realistic.yml"))
+api = dtspec.api.Api(spec)
+api.generate_sources()
+sources_data = parse_sources(api.spec["sources"])
+actual_data = hello_world_multiple_transformer(**sources_data)
+serialized_actuals = serialize_actuals(actual_data)
+api.load_actuals(serialized_actuals)
+api.assert_expectations()
+````
+## dbt support
+dtspec also contains a CLI tool that can facilitate using it with [dbt](https://getdbt.com).
+The CLI tools helps you set up a test environment, run dbt in that environment, and
+execute the dbt tests.  The CLI tool currently only works for Postgres and Snowflake dbt
+projects.
+See the [dbt-container-skeleton](https://github.com/gnilrets/dbt-container-skeleton) for a
+working example.
+### dtspec CLI Config
+All of the dtspec files should be placed in a subdirectory of your dbt project: `dbt/dtspec`.
+The first thing to set up for the dtspec CLI is the configuration file, which should
+be placed in `dtspec/config.yml`.  The configuration file tells dtspec how to recreate
+the table schemas in a test environment, where to recreate the table schemas, and where
+to find the results of a dbt run.  Here is an example:
+````yaml
+# A target environment is where the output of data transformations appear.
+# Typically, there will only be on target environment.
+target_environments:
+  # The target environment IS NOT your production environment.  It needs to be a separate
+  # database where dbt will run against the test data that dtspec generates.  The name
+  # of this environment needs to be the same as a target defined in dbt profiles.yml (in this case `dtspec`)
+  dtspec:
+    # Field names here follow the same conventions as dbt profiles.yml (https://docs.getdbt.com/dbt-cli/configure-your-profile)
+    type: postgres
+    host: "{{ env_var('POSTGRES_HOST') }}"
+    port: 5432
+    user: "{{ env_var('POSTGRES_USER') }}"
+    password: "{{ env_var('POSTGRES_PASSWORD') }}"
+    dbname: "{{ env_var('POSTGRES_DBNAME') }}_dtspec"
+# A source environment is where source data is located.  It may be in the same database
+# as the target environment or it may be different if the data warehouse supports it (e.g., Snowflake).
+# It is also possible to define several source environments if your source data is spread
+# across multiple databases.
+source_environments:
+  raw:
+    # Use `tables` to specify source tables that need to be present to run tests.
+    tables:
+      # `wh_raw` is the name of a namespace (aka schema) in the `raw` source environment
+      wh_raw:
+        # tables may be listed indivdually (or, use `wh_raw: '*'` to indicate all tables within the `wh_raw` namespace)
+        - raw_customers
+        - raw_orders
+        - raw_payments
+    # In order to run tests, we need to replicate the table schemas in the test environment.
+    # The schema section here contains credentials for a database where those tables are defined.
+    # This is likely a production database (in your warehouse), or is a production replica.
+    # dtspec only uses this database to read reflect the table schemas (via `dtspec db --fetch-schemas`).
+    schema:
+      type: postgres
+      host: "{{ env_var('POSTGRES_HOST') }}"
+      port: 5432
+      user: "{{ env_var('POSTGRES_USER') }}"
+      password: "{{ env_var('POSTGRES_PASSWORD') }}"
+      dbname: "{{ env_var('POSTGRES_DBNAME') }}"
+    # The test section contains credentials for a database where test data will be created.
+    # Data in this database is destroyed and rebuilt for every run of dtspec and SHOULD NOT be
+    # the same as the schema credentials defined above.
+    test:
+      type: postgres
+      host: "{{ env_var('POSTGRES_HOST') }}"
+      port: 5432
+      user: "{{ env_var('POSTGRES_USER') }}"
+      password: "{{ env_var('POSTGRES_PASSWORD') }}"
+      dbname: "{{ env_var('POSTGRES_DBNAME') }}_dtspec"
+  # Pretending snapshots are in a different database because Postgres doesn't support cross-db queries.
+  # This is how you would do it if snapshots were in a different database than other raw source data.
+  snapshots:
+    tables:
+      snapshots: '*'
+    schema:
+      type: postgres
+      host: "{{ env_var('POSTGRES_HOST') }}"
+      port: 5432
+      user: "{{ env_var('POSTGRES_USER') }}"
+      password: "{{ env_var('POSTGRES_PASSWORD') }}"
+      dbname: "{{ env_var('POSTGRES_DBNAME') }}"
+    test:
+      type: postgres
+      host: "{{ env_var('POSTGRES_HOST') }}"
+      port: 5432
+      user: "{{ env_var('POSTGRES_USER') }}"
+      password: "{{ env_var('POSTGRES_PASSWORD') }}"
+      dbname: "{{ env_var('POSTGRES_DBNAME') }}_dtspec"
+````
+### Test environment setup
+Once the configuration file has been defined, the next step is to fetch/reflect schemas for
+the source tables.  From the `dbt` directory, run the following CLI command:
+    dtspec db --fetch-schemas
+This will query all of the databases defined in the `schema` section of the source
+environments defined in `dtspec/config.yml`, and create table schema files in `dtspec/schemas`.
+The files in this directory should be committed to source control and updated whenever
+your source data changes (in so much as it would affect the dtspec tests).
+Next, initialize the test databases defined in the `test` section of the source
+environments defined in `dtspec/config.yml` with the CLI command
+    dtspec db --init-test-db
+This will create empty source tables in your test databases, ready to be loaded with test data.
+### Executing tests
+In order to use dtspec with dbt, spec files must make use of the `dbt_source` and `dbt_ref`
+Jinja functions.  These are analogous to the dbt `source` and `ref` functions.  dtspec
+will compile your dbt project and use the `dbt/target/manifest.json` file to resolve the names
+of sources and targets that you want to test.  For example, the SomeStudents factory
+would be written as follows if this were a dbt project:
+````yaml
+factories:
+  - factory: SomeStudents
+    data:
+      - source: {{ dbt_source('raw', 'raw_students') }}
+        table: |
+          | id | name   |
+          | -  | -      |
+          | 1  | Buffy  |
+          | 2  | Willow |
+````
+and an expectation would be:
+````yaml
+    cases:
+      - case: HelloGang
+        expected:
+          data:
+            - target: {{ dbt_ref('salutations') }}
+              table: |
+                | id | salutation   |
+                | -  | -            |
+                | 1  | Hello Buffy  |
+                | 2  | Hello Willow |
+````
+With these references set, dtspec tests can be executed via the CLI command:
+    dtspec test-dbt
+This command will do the following:
+1. It will first compile your dbt project.  If your dbt code does not change between
+   dtspec tests, you may skip this step by pass the `--partial-parse` argument.
+2. The dtspec spec files are compiled into a single document and dbt references are resolved.
+   The compiled dtspec document is output to `dtspec/compiled_specs.yml`, which does not
+   need to be saved to source control.
+3. Source data is generated and loaded into the test databases.
+4. dbt is executed against the test database.
+5. The models that dbt built in the target test environment are extracted.  These are the "actuals".
+6. The actuals are compared with the expected data as specified in the dtspec specs.
+The `test-dbt` command has several options that may be useful.  See `dtspec test-dbt -h` for a full
+list, but here are some noteworthy options:
+- `--models` specifies the models that dbt should run, using standard dbt model selection syntax.
+- `--scenarios` is used to restrict the number of scenarios that are tested.  The argument is a
+  regular expression that will match on the compiled Scenario name.  This can be used
+  in combination with the `--models` command to only run those tests and models that you're
+  concerned with.
+### Additonal CLI notes
+#### Log level
+If you want to see more detailed loggin information, set the `DTSPEC_LOG_LEVEL` environment
+variable (options are DEBUG, INFO, WARN, and ERROR).  For example:
+    DTSPEC_LOG_LEVEL=INFO dtspec test-dbt
+#### Project location
+If you really don't want to put dtspec in the dbt project directory you can override the
+default by setting `DTSPEC_ROOT` and `DBT_ROOT` environment variables that point
+to the root path of these projects.
+#### Special Values
+When dtspec is run via the CLI, it recognizes nulls and booleans in the spec files.  To
+indicate these kinds of values in a dtspec spec, use `{NULL}`, `{True}`, and `{False}`.
+For example:
+````yaml
+    cases:
+      - case: HelloGang
+        expected:
+          data:
+            - target: {{ dbt_ref('salutations') }}
+              table: |
+                | id | salutation   | is_witch |
+                | -  | -            | -        |
+                | 1  | Hello Buffy  | {False}  |
+                | 2  | Hello Willow | {True}   |
+                | 3  | Hello NA     | {NULL}   |
+````
+#### Jinja context
+When writing spec files that will be parsed with the dtspec CLI, the following functions
+are available in the jinja context:
+* `datetime` -- This is the [Python datetime.datetime type](https://docs.python.org/3/library/datetime.html)
+* `date` -- This is the [Python datetime.date type](https://docs.python.org/3/library/datetime.html)
+* `relativedelta` -- This is the [Python relativedelta type0](https://dateutil.readthedocs.io/en/stable/relativedelta.html)
+* `UTCNOW` -- The UTC datetime value at the time the specs are parsed
+* `TODAY` -- The current UTC date value at the time the specs are parsed
+* `YESTERDAY` -- Yesterday's date
+* `TOMORROW` -- Tomorrow's date
+* `dbt_source` -- Used to reference dbt sources
+* `dbt_ref` -- Used to reference dbt models
+Some example of using these functions:
+    - source: raw_products
+       table: |
+        | export_time                          | file                    | product_id | product_name |
+        | -                                    | -                       | -          | -            |
+        | {{ YESTERDAY }}                      | products-2021-01-06.csv | milk       | Milk         |
+        | {{ TODAY - relativedelta(days=5) }}  | products-2021-01-02.csv | milk       | Milk         |
+## Additional notes about dtspec
+* At the moment, all source data values are generated as strings.  It
+  is up to the the user to enforce data types suitable to their data
+  transformation system.  Note that the dtspec dbt CLI commands handle this
+  for Postgres and Snowflake warehouses.
+* Additionally, data expectations are stringified prior to running assertions.
+## Contributing
+We welcome contributors!  Please submit any suggests or pull requests in Github.
+### Developer setup
+Create an appropriate python environment.  I like [miniconda](https://conda.io/miniconda.html),
+but use whatever you like:
+    conda create --name dtspec python=3.8
+    conda activate dtspec
+Then install pip packages
+    pip install pip-tools
+    pip install --ignore-installed -r requirements.txt
+run tests via
+    inv test
+and the linter via
+    inv lint
+
+%package help
+Summary:	Development documents and examples for dtspec
+Provides:	python3-dtspec-doc
+%description help
+version: '0.1'
+description: HelloWorld - Simplest example of running dtspec
+# The names of sources and targets is arbitrary, but it's up to the user to determine
+# how they get mapped to/from their data transformation system.
+sources:
+  - source: raw_students
+targets:
+  - target: salutations
+````
+These define our inputs and outputs.  But we also need to define how to generate
+data for the input(s).  For that, we define a **factory**:
+````yaml
+factories:
+  - factory: SomeStudents
+    description: Minimal example of what some student records may look like
+    data:
+      - source: raw_students
+        # Tables written as a markdown table
+        table: |
+          | id | name   |
+          | -  | -      |
+          | 1  | Buffy  |
+          | 2  | Willow |
+````
+Lastly, we need to describe how we expect the data to look after it has been transformed.
+To do this, we define **scenarios** and **cases**.  Scenarios are collections of cases
+that share some common data factory or describe similar situations.  For now, our
+transform spec will just contain a single scenario and a single case:
+````yaml
+scenarios:
+  - scenario: Hello World
+    description: The simplest scenario
+    # All cases in this scenario will use this factory (which may be modified on case-by-case basis)
+    factory:
+        parents:
+          - SomeStudents
+    cases:
+      - case: HelloGang
+        description: Make sure we say hello to everyone
+        expected:
+          data:
+            - target: salutations
+              # The actual output may also contain the "name" field, but the expectation
+              # will ignore comparing any fields not listed in the expected table.
+              table: |
+                | id | salutation   |
+                | -  | -            |
+                | 1  | Hello Buffy  |
+                | 2  | Hello Willow |
+````
+That's it. See also the [full YAML spec](tests/hello_world.yml).
+Now that we've described the full transform spec, we need to use it.  The first step is to
+parse the YAML file, send it to the dtspec api, and have dtspec generate source data:
+````python
+import dtspec
+import yaml
+spec = yaml.safe_load(open("tests/hello_world.yml"))
+api = dtspec.api.Api(spec)
+api.generate_sources()
+````
+The specific steps taken at this point are going to be sensitive to the data transformation
+environment being used, but we'll stick with our Pandas transformations for the sake of this
+tutorial.  Given this, we can define a simple function that converts the source data returned
+from dtspec into Pandas dataframes:
+````python
+import pandas as pd
+def parse_sources(sources):
+    "Converts test data returned from dtspec api into Pandas dataframes"
+    return {
+        source_name: pd.DataFrame.from_records(data.serialize())
+        for source_name, data in sources.items()
+    }
+````
+We can then run those test Pandas dataframes through our data transformation function.
+````python
+sources_data = parse_sources(api.spec["sources"])
+actual_data = hello_world_transformer(**sources_data)
+````
+Next, we need to convert the output dataframes of the transformations, `actual_data`,
+back into a format that can be loaded into dtspec for comparison.  For Pandas,
+this function is:
+````python
+def serialize_actuals(actuals):
+    "Converts Pandas dataframe results into form needed to load dtspec api actuals"
+    return {
+        target_name: json.loads(dataframe.astype(str).to_json(orient="records"))
+        for target_name, dataframe in actuals.items()
+    }
+````
+It is loaded into dtspec using:
+````python
+serialized_actuals = serialize_actuals(actual_data)
+api.load_actuals(serialized_actuals)
+````
+Finally, dtspec can be called to run all of the expectations:
+````python
+api.assert_expectations()
+````
+Putting all of this together:
+````python
+spec = yaml.safe_load(open("tests/hello_world.yml"))
+api = dtspec.api.Api(spec)
+api.generate_sources()
+sources_data = parse_sources(api.spec["sources"])
+actual_data = hello_world_transformer(**sources_data)
+serialized_actuals = serialize_actuals(actual_data)
+api.load_actuals(serialized_actuals)
+````
+Try running the above code and changing either the YAML spec or the `hello_world_transformer`
+function and see how dtspec responds.
+### Hello World With Multiple Test Cases
+Running tests with multiple cases that reference the same data sources
+introduces a complicating factor. One of the reasons that makes
+it hard to build tests for ETL/ELT is the fact that many data
+transformation systems in use today have a high latency for even very
+small transformations.  For example, Redshift is a distributed RDBMS
+that can process billions of rows in minutes, millions of rows in
+seconds, thousands of rows in seconds, or 10s of rows in, well,
+seconds.  Given these latency issues, we don't want to have to rely on
+loading data into our system, running a test, clearing out the data,
+loading some more, running the next test, and so on as is often
+done when testing ORM-based applications like Rails or Django.
+dtspec seeks to minimize the number of requests on the data
+transformation system in order to deal with these latency issues.
+It does this by "stacking" the test data generated in each case
+and delivering back to the user all of this stacked data.  The user
+then loads this stacked data into their data transformation system
+**once**, runs the data transformations **once**, and then collects
+the resulting output **once**.
+Let's see how dtspec handles this in action.
+First, let's change our hello world data transformation a bit.  Instead of
+just saying hello to our heroes, let's say goodbye to any villians (as
+identified by a `clique` data field).
+````python
+def hello_world_multiple_transformer(raw_students):
+    def salutation(row):
+        if row["clique"] == "Scooby Gang":
+            return "Hello {}".format(row["name"])
+        return "Goodbye {}".format(row["name"])
+    salutations_df = raw_students.copy()
+    salutations_df["salutation"] = salutations_df.apply(salutation, axis=1)
+    return {"salutations": salutations_df}
+````
+While it would be possible to test saying hello or goodbye in a single
+case just by adding more records to the source data, we'll split it
+into two to demonstrate how multiple cases work.  Here's how the YAML would look:
+````yaml
+scenarios:
+  - scenario: Hello World With Multiple Cases
+    description: The simplest scenario
+    factory:
+      parents:
+        - SomeStudents
+    cases:
+      - case: HelloGang
+        description: Make sure we say hello to everyone
+        expected:
+          data:
+            - target: salutations
+              table: |
+                | id | name   | clique      | salutation   |
+                | -  | -      | -           | -            |
+                | 1  | Buffy  | Scooby Gang | Hello Buffy  |
+                | 2  | Willow | Scooby Gang | Hello Willow |
+      - case: GoodbyeVillians
+        description: Say goodbye to villians
+        # For this case, we tweak the factory defined for the scenario.
+        factory:
+          # The ids here might be the same as above.  However, these are just named
+          # references and get translated into unique ids when the source data
+          # is generated.
+          data:
+            - source: raw_students
+              table: |
+                | id | name     |
+                | -  | -        |
+                | 1  | Drusilla |
+                | 2  | Harmony  |
+              # Use values to populate a constant over all records
+              values:
+                - column: clique
+                  value: Vampires
+        expected:
+          data:
+            # Again, the ids here are not the actual ids sent to dtspec after performing
+            # the transformations.  They are just named references and dtspec
+            # keeps track of the relationship between the actual ids and the named ones.
+            - target: salutations
+              table: |
+                | id | name     | clique   | salutation       |
+                | -  | -        | -        | -                |
+                | 1  | Drusilla | Vampires | Goodbye Drusilla |
+                | 2  | Harmony  | Vampires | Goodbye Harmony  |
+````
+This won't quite work as is, because we're missing something.  We have
+two cases that describe variations on the source data `raw_students`
+and the output `salutations`.  dtspec collects the source data
+definitions from each case and stacks them into a single data source.
+The user then runs the transformations on that source and generates a
+single target to provide back to dtspec.  But dtspec has to know which record
+belongs to which case.  To do this, we have to define an
+**identifier** that tells dtspec which columns should be used to identify
+a record as belonging to a case.  A good identifier is often a primary
+key that uniquely defines a record, but it is not strictly required to
+be unique across all records.
+For this example, we'll define an identifier called "students" with a single
+**identifier attribute** called `id` that is a unique integer:
+````yaml
+identifiers:
+  - identifier: students
+    attributes:
+      - field: id
+        generator: unique_integer
+````
+We tell dtspec that this identifier is associated with the `id` columns of both
+the source and the target via:
+````yaml
+sources:
+  - source: raw_students
+    identifier_map:
+      - column: id
+        identifier:
+          name: students
+          attribute: id
+targets:
+  - target: salutations
+    identifier_map:
+      - column: id
+        identifier:
+          name: students
+          attribute: id
+````
+With the sources and targets with identifiers, the values we see in
+the source factories and target expectations are not the values that
+are actually used in the data.  Instead, they are simply **named
+refereces**.  For example, in the "HelloGang" case, `id=1` belongs to
+Buffy and `id=2` belongs to Willow.  But when dtspec generates the source
+data, the actual values may be 3 and 9, or 4 and 7, or something else.
+Unique values are not generated in any deterministic manner -- each
+run of dtspec can give a diferent set.  dtspec only guarantees that the
+each named reference will be a unique integer (via the `generator`
+defined in the `identifier` section).
+Futhermore, in the second case called "GoodbyeVillians", we see that
+`id=1` belongs to Drusilla and `id=2` belongs to Harmony.  dtspec will
+generate unique values for this case as well, and they **will not**
+conflict with the values generated for the first case.  So dtspec will pass
+back to the user 4 total records (Buffy, Willow, Drusilla, Harmony) with 4
+different ids
+With the [full YAML spec](tests/hello_world_multiple_cases.yml) defined, we can
+run the assertions in the same fashion as the the earlier example
+````python
+spec = yaml.safe_load(open("tests/hello_world_multiple_cases.yml"))
+api = dtspec.api.Api(spec)
+api.generate_sources()
+sources_data = parse_sources(api.spec["sources"])
+actual_data = hello_world_multiple_transformer(**sources_data)
+serialized_actuals = serialize_actuals(actual_data)
+api.load_actuals(serialized_actuals)
+api.assert_expectations()
+````
+#### Embedded Identifiers
+It is also possible to embed identifiers in the value of a particular column.
+For example, suppose our `salutation` column said hello to the `id` instead
+of the name of the person.  To make this work, we have to put a particular
+string pattern in the column that indicates the name of the identifier, the
+attribute, and the named id - `{identifier.attribute[named_id]}`.  The
+yaml spec would look like:
+````yaml
+      - case: HelloGang
+        description: Make sure we say hello to everyone
+        expected:
+          data:
+            - target: salutations
+              table: |
+                | id | name   | clique      | salutation             |
+                | -  | -      | -           | -                      |
+                | 1  | Buffy  | Scooby Gang | Hello {students.id[1]} |
+                | 2  | Willow | Scooby Gang | Hello {students.id[2]} |
+````
+The [realistic example](tests/realistic.yml) discussed below has another example
+of using embedded identifiers.
+**Note** that embedded identifiers cannot be used to associate records
+with cases.  A target must have at least one column listed in the
+`identifier_map` section.
+### A More Realistic Example
+Finally, let's example a more realistic example that one might
+encounter when building a data warehouse.  In these situations, we'll
+have multiple sources, targets, scenarios, and cases.  Now suppose we
+have a students table, where every student belongs to a school and
+takes 0 to many classes.  Our goal is to create one denormalized table
+that combines all of these data sources into one table.  Additionally,
+we want to create a table that aggregates all of our students to give
+a count of the students per school.  In Pandas, the data transformation
+might look like:
+````python
+def realistic_transformer(raw_students, raw_schools, raw_classes, dim_date):
+    student_schools = raw_students.rename(
+        columns={"id": "student_id", "external_id": "card_id"}
+    ).merge(
+        raw_schools.rename(columns={"id": "school_id", "name": "school_name"}),
+        how="inner",
+        on="school_id",
+    )
+    student_classes = student_schools.merge(
+        raw_classes.rename(columns={"name": "class_name"}),
+        how="inner",
+        on="student_id",
+    ).merge(
+        dim_date.rename(columns={"date": "start_date"}), how="left", on="start_date"
+    )
+    student_classes["student_class_id"] = student_classes.apply(
+        lambda row: "-".join([str(row["card_id"]), str(row["class_name"])]), axis=1
+    )
+    students_per_school = (
+        student_schools.groupby(["school_name"])
+        .size()
+        .to_frame(name="number_of_students")
+        .reset_index()
+    )
+    return {
+        "student_classes": student_classes,
+        "students_per_school": students_per_school,
+    }
+````
+Given the [full YAML spec](tests/realistic.yml) defined, we can again run
+the data assertions using a familiar pattern:
+````python
+spec = yaml.safe_load(open("tests/realistic.yml"))
+api = dtspec.api.Api(spec)
+api.generate_sources()
+sources_data = parse_sources(api.spec["sources"])
+actual_data = hello_world_multiple_transformer(**sources_data)
+serialized_actuals = serialize_actuals(actual_data)
+api.load_actuals(serialized_actuals)
+api.assert_expectations()
+````
+## dbt support
+dtspec also contains a CLI tool that can facilitate using it with [dbt](https://getdbt.com).
+The CLI tools helps you set up a test environment, run dbt in that environment, and
+execute the dbt tests.  The CLI tool currently only works for Postgres and Snowflake dbt
+projects.
+See the [dbt-container-skeleton](https://github.com/gnilrets/dbt-container-skeleton) for a
+working example.
+### dtspec CLI Config
+All of the dtspec files should be placed in a subdirectory of your dbt project: `dbt/dtspec`.
+The first thing to set up for the dtspec CLI is the configuration file, which should
+be placed in `dtspec/config.yml`.  The configuration file tells dtspec how to recreate
+the table schemas in a test environment, where to recreate the table schemas, and where
+to find the results of a dbt run.  Here is an example:
+````yaml
+# A target environment is where the output of data transformations appear.
+# Typically, there will only be on target environment.
+target_environments:
+  # The target environment IS NOT your production environment.  It needs to be a separate
+  # database where dbt will run against the test data that dtspec generates.  The name
+  # of this environment needs to be the same as a target defined in dbt profiles.yml (in this case `dtspec`)
+  dtspec:
+    # Field names here follow the same conventions as dbt profiles.yml (https://docs.getdbt.com/dbt-cli/configure-your-profile)
+    type: postgres
+    host: "{{ env_var('POSTGRES_HOST') }}"
+    port: 5432
+    user: "{{ env_var('POSTGRES_USER') }}"
+    password: "{{ env_var('POSTGRES_PASSWORD') }}"
+    dbname: "{{ env_var('POSTGRES_DBNAME') }}_dtspec"
+# A source environment is where source data is located.  It may be in the same database
+# as the target environment or it may be different if the data warehouse supports it (e.g., Snowflake).
+# It is also possible to define several source environments if your source data is spread
+# across multiple databases.
+source_environments:
+  raw:
+    # Use `tables` to specify source tables that need to be present to run tests.
+    tables:
+      # `wh_raw` is the name of a namespace (aka schema) in the `raw` source environment
+      wh_raw:
+        # tables may be listed indivdually (or, use `wh_raw: '*'` to indicate all tables within the `wh_raw` namespace)
+        - raw_customers
+        - raw_orders
+        - raw_payments
+    # In order to run tests, we need to replicate the table schemas in the test environment.
+    # The schema section here contains credentials for a database where those tables are defined.
+    # This is likely a production database (in your warehouse), or is a production replica.
+    # dtspec only uses this database to read reflect the table schemas (via `dtspec db --fetch-schemas`).
+    schema:
+      type: postgres
+      host: "{{ env_var('POSTGRES_HOST') }}"
+      port: 5432
+      user: "{{ env_var('POSTGRES_USER') }}"
+      password: "{{ env_var('POSTGRES_PASSWORD') }}"
+      dbname: "{{ env_var('POSTGRES_DBNAME') }}"
+    # The test section contains credentials for a database where test data will be created.
+    # Data in this database is destroyed and rebuilt for every run of dtspec and SHOULD NOT be
+    # the same as the schema credentials defined above.
+    test:
+      type: postgres
+      host: "{{ env_var('POSTGRES_HOST') }}"
+      port: 5432
+      user: "{{ env_var('POSTGRES_USER') }}"
+      password: "{{ env_var('POSTGRES_PASSWORD') }}"
+      dbname: "{{ env_var('POSTGRES_DBNAME') }}_dtspec"
+  # Pretending snapshots are in a different database because Postgres doesn't support cross-db queries.
+  # This is how you would do it if snapshots were in a different database than other raw source data.
+  snapshots:
+    tables:
+      snapshots: '*'
+    schema:
+      type: postgres
+      host: "{{ env_var('POSTGRES_HOST') }}"
+      port: 5432
+      user: "{{ env_var('POSTGRES_USER') }}"
+      password: "{{ env_var('POSTGRES_PASSWORD') }}"
+      dbname: "{{ env_var('POSTGRES_DBNAME') }}"
+    test:
+      type: postgres
+      host: "{{ env_var('POSTGRES_HOST') }}"
+      port: 5432
+      user: "{{ env_var('POSTGRES_USER') }}"
+      password: "{{ env_var('POSTGRES_PASSWORD') }}"
+      dbname: "{{ env_var('POSTGRES_DBNAME') }}_dtspec"
+````
+### Test environment setup
+Once the configuration file has been defined, the next step is to fetch/reflect schemas for
+the source tables.  From the `dbt` directory, run the following CLI command:
+    dtspec db --fetch-schemas
+This will query all of the databases defined in the `schema` section of the source
+environments defined in `dtspec/config.yml`, and create table schema files in `dtspec/schemas`.
+The files in this directory should be committed to source control and updated whenever
+your source data changes (in so much as it would affect the dtspec tests).
+Next, initialize the test databases defined in the `test` section of the source
+environments defined in `dtspec/config.yml` with the CLI command
+    dtspec db --init-test-db
+This will create empty source tables in your test databases, ready to be loaded with test data.
+### Executing tests
+In order to use dtspec with dbt, spec files must make use of the `dbt_source` and `dbt_ref`
+Jinja functions.  These are analogous to the dbt `source` and `ref` functions.  dtspec
+will compile your dbt project and use the `dbt/target/manifest.json` file to resolve the names
+of sources and targets that you want to test.  For example, the SomeStudents factory
+would be written as follows if this were a dbt project:
+````yaml
+factories:
+  - factory: SomeStudents
+    data:
+      - source: {{ dbt_source('raw', 'raw_students') }}
+        table: |
+          | id | name   |
+          | -  | -      |
+          | 1  | Buffy  |
+          | 2  | Willow |
+````
+and an expectation would be:
+````yaml
+    cases:
+      - case: HelloGang
+        expected:
+          data:
+            - target: {{ dbt_ref('salutations') }}
+              table: |
+                | id | salutation   |
+                | -  | -            |
+                | 1  | Hello Buffy  |
+                | 2  | Hello Willow |
+````
+With these references set, dtspec tests can be executed via the CLI command:
+    dtspec test-dbt
+This command will do the following:
+1. It will first compile your dbt project.  If your dbt code does not change between
+   dtspec tests, you may skip this step by pass the `--partial-parse` argument.
+2. The dtspec spec files are compiled into a single document and dbt references are resolved.
+   The compiled dtspec document is output to `dtspec/compiled_specs.yml`, which does not
+   need to be saved to source control.
+3. Source data is generated and loaded into the test databases.
+4. dbt is executed against the test database.
+5. The models that dbt built in the target test environment are extracted.  These are the "actuals".
+6. The actuals are compared with the expected data as specified in the dtspec specs.
+The `test-dbt` command has several options that may be useful.  See `dtspec test-dbt -h` for a full
+list, but here are some noteworthy options:
+- `--models` specifies the models that dbt should run, using standard dbt model selection syntax.
+- `--scenarios` is used to restrict the number of scenarios that are tested.  The argument is a
+  regular expression that will match on the compiled Scenario name.  This can be used
+  in combination with the `--models` command to only run those tests and models that you're
+  concerned with.
+### Additonal CLI notes
+#### Log level
+If you want to see more detailed loggin information, set the `DTSPEC_LOG_LEVEL` environment
+variable (options are DEBUG, INFO, WARN, and ERROR).  For example:
+    DTSPEC_LOG_LEVEL=INFO dtspec test-dbt
+#### Project location
+If you really don't want to put dtspec in the dbt project directory you can override the
+default by setting `DTSPEC_ROOT` and `DBT_ROOT` environment variables that point
+to the root path of these projects.
+#### Special Values
+When dtspec is run via the CLI, it recognizes nulls and booleans in the spec files.  To
+indicate these kinds of values in a dtspec spec, use `{NULL}`, `{True}`, and `{False}`.
+For example:
+````yaml
+    cases:
+      - case: HelloGang
+        expected:
+          data:
+            - target: {{ dbt_ref('salutations') }}
+              table: |
+                | id | salutation   | is_witch |
+                | -  | -            | -        |
+                | 1  | Hello Buffy  | {False}  |
+                | 2  | Hello Willow | {True}   |
+                | 3  | Hello NA     | {NULL}   |
+````
+#### Jinja context
+When writing spec files that will be parsed with the dtspec CLI, the following functions
+are available in the jinja context:
+* `datetime` -- This is the [Python datetime.datetime type](https://docs.python.org/3/library/datetime.html)
+* `date` -- This is the [Python datetime.date type](https://docs.python.org/3/library/datetime.html)
+* `relativedelta` -- This is the [Python relativedelta type0](https://dateutil.readthedocs.io/en/stable/relativedelta.html)
+* `UTCNOW` -- The UTC datetime value at the time the specs are parsed
+* `TODAY` -- The current UTC date value at the time the specs are parsed
+* `YESTERDAY` -- Yesterday's date
+* `TOMORROW` -- Tomorrow's date
+* `dbt_source` -- Used to reference dbt sources
+* `dbt_ref` -- Used to reference dbt models
+Some example of using these functions:
+    - source: raw_products
+       table: |
+        | export_time                          | file                    | product_id | product_name |
+        | -                                    | -                       | -          | -            |
+        | {{ YESTERDAY }}                      | products-2021-01-06.csv | milk       | Milk         |
+        | {{ TODAY - relativedelta(days=5) }}  | products-2021-01-02.csv | milk       | Milk         |
+## Additional notes about dtspec
+* At the moment, all source data values are generated as strings.  It
+  is up to the the user to enforce data types suitable to their data
+  transformation system.  Note that the dtspec dbt CLI commands handle this
+  for Postgres and Snowflake warehouses.
+* Additionally, data expectations are stringified prior to running assertions.
+## Contributing
+We welcome contributors!  Please submit any suggests or pull requests in Github.
+### Developer setup
+Create an appropriate python environment.  I like [miniconda](https://conda.io/miniconda.html),
+but use whatever you like:
+    conda create --name dtspec python=3.8
+    conda activate dtspec
+Then install pip packages
+    pip install pip-tools
+    pip install --ignore-installed -r requirements.txt
+run tests via
+    inv test
+and the linter via
+    inv lint
+
+%prep
+%autosetup -n dtspec-0.7.5
+
+%build
+%py3_build
+
+%install
+%py3_install
+install -d -m755 %{buildroot}/%{_pkgdocdir}
+if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi
+if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi
+if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi
+if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi
+pushd %{buildroot}
+if [ -d usr/lib ]; then
+	find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/lib64 ]; then
+	find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/bin ]; then
+	find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+if [ -d usr/sbin ]; then
+	find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst
+fi
+touch doclist.lst
+if [ -d usr/share/man ]; then
+	find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst
+fi
+popd
+mv %{buildroot}/filelist.lst .
+mv %{buildroot}/doclist.lst .
+
+%files -n python3-dtspec -f filelist.lst
+%dir %{python3_sitelib}/*
+
+%files help -f doclist.lst
+%{_docdir}/*
+
+%changelog
+* Wed May 31 2023 Python_Bot <Python_Bot@openeuler.org> - 0.7.5-1
+- Package Spec generated
diff --git a/sources b/sources
new file mode 100644
index 0000000..dae4273
--- /dev/null
+++ b/sources
@@ -0,0 +1 @@
+2c5c3d465a8cd9c2e13cc3b921464b43  dtspec-0.7.5.tar.gz
author	CoprDistGit <infra@openeuler.org>	2023-05-31 05:53:23 +0000
committer	CoprDistGit <infra@openeuler.org>	2023-05-31 05:53:23 +0000
commit	863316c39e6849bffc18f96045838f19101f07e2 (patch)
tree	548891afc85590ad4dcc52bf363e3af805ab7c01
parent	e27568c54925ef66d090a222fd507fb7afdc326d (diff)