{
    "componentChunkName": "component---src-templates-blog-blog-detail-tsx",
    "path": "/blog/when-tidb-meets-dbt",
    "result": {"pageContext":{"blog":{"id":"Blogs_373","title":"当 TiDB 遇见 dbt丨让数据价值清晰可见","tags":["TiDB"],"category":{"name":"产品技术解读"},"summary":"TiDB 社区在近日推出了 dbt-tidb 插件，实现了 TiDB 和 dbt 的兼容适配。本文将通过一个简单的案例介绍如何通过 dbt 实现 TiDB 中数据的简单分析。","body":"dbt （data build tool）是一款流行的开源数据转换工具，能够通过 SQL 实现数据转化，将命令转化为表或者视图，提升数据分析师的工作效率。TiDB 社区在近日推出了 [dbt-tidb](https://github.com/pingcap/dbt-tidb) 插件，实现了 TiDB 和 [dbt](https://www.getdbt.com/) 的兼容适配。本文将通过一个简单的案例介绍如何通过 dbt 实现 TiDB 中数据的简单分析。\n\ndbt 主要功能在于转换数据库或数据仓库中的数据，在 E（Extract）、L（Load）、T（Transform） 的流程中，仅负责转换（transform）的过程。 通过 dbt-tidb 插件，数据分析师在使用 TiDB 的过程中，能够通过 SQL 直接建立表单并匹配数据，而无需关注创建 table 或 view 的过程，并且可以直观地看到数据的流动；同时能够运用 dbt 的 Jinja 编写 SQL、测试、包管理等功能，大大提升工作效率。\n\n![1.png](https://img1.www.pingcap.com/prod/1_b4af503f6a.png)\n（图片来源：https://blog.getdbt.com/what-exactly-is-dbt/）\n\n接下来，我将以 [dbt 官方教程](https://docs.getdbt.com/tutorial/setting-up)为例，给大家介绍下 TiDB 与 dbt 的结合使用。\n\n本例用到的相关软件及其版本要求：\n\n- TiDB 5.3 或更高版本\n- dbt 1.0.1 或更高版本\n- dbt-tidb 1.0.0\n\n## 安装\n\ndbt 除了本地 CLI 工具外，还支持 [dbt Cloud](https://docs.getdbt.com/docs/dbt-cloud/cloud-overview) (目前，dbt Cloud 只支持 dbt-lab 官方维护的 adapter)，其中本地 CLI 工具有多种安装方式。我们这里直接使用 pypi 安装 dbt 和 dbt-tidb 插件。\n\n安装 dbt 和 dbt-tidb，只需要一条命令，因为 dbt 会作为依赖在安装 dbt-tidb 的时候顺便安装。 \n\n```Bash\n$ pip install dbt-tidb\n```\n\ndbt 也可自行安装，安装方式参考[官方安装教程](https://docs.getdbt.com/dbt-cli/install/overview)。\n\n## 创建项目：jaffle_shop\n\njaffle_shop 是 dbt-lab 提供的用于演示 dbt 功能的工程项目，你可以直接从 GitHub 上获取它。\n\n```Bash\n$ git clone https://github.com/dbt-labs/jaffle_shop\n\n$ cd jaffle_shop\n```\n\n这里展开 jaffle_shop 工程目录下所有文件。\n\n- `dbt_project.yml` 是 dbt 项目的配置文件，其中保存着项目名称、数据库配置文件的路径信息等。\n- `models` 目录下存放该项目的 SQL 模型和 table 约束，注意这部分是数据分析师自行编写的。\n- `seed` 目录存放 CSV 文件。此类文件可以来源于数据库导出工具，例如TiDB 可以通过 [Dumpling](https://docs.pingcap.com/tidb/v4.0/dumpling-overview) 把 table 中的数据导出为 CSV 文件。jaffle_shop 工程中，这些 CSV 文件用来作为待处理的原始数据。\n\n关于它们更加具体的内容，在用到上面的某个文件或目录后，我会再次进行更详细的说明。\n\n```Bash\nubuntu@ubuntu:~/jaffle_shop$ tree\n.\n\n├── dbt_project.yml\n\n├── etc\n\n│   ├── dbdiagram_definition.txt\n\n│   └── jaffle_shop_erd.png\n\n├── LICENSE\n\n├── models\n\n│   ├── customers.sql\n\n│   ├── docs.md\n\n│   ├── orders.sql\n\n│   ├── overview.md\n\n│   ├── schema.yml\n\n│   └── staging\n\n│       ├── schema.yml\n\n│       ├── stg_customers.sql\n\n│       ├── stg_orders.sql\n\n│       └── stg_payments.sql\n\n├── README.md\n\n└── seeds\n\n    ├── raw_customers.csv\n\n    ├── raw_orders.csv\n\n    └── raw_payments.csv\n```\n\n## 配置项目\n\n1.全局配置\n\ndbt 有一个默认的全局配置文件：`~/.dbt/profiles.yml`，我们首先在用户目录下建立该文件，并配置 TiDB 数据库的连接信息。\n\n```Bash\n $ vi ~/.dbt/profiles.yml\n\n jaffle_shop_tidb:                        # 工程名称\n\n  target: dev                             # 目标\n\n  outputs:\n\n    dev:\n\n      type: tidb                         # 适配器类型\n\n      server: 127.0.0.1                  # 地址\n\n      port: 4000                         # 端口号\n\n      schema: analytics                  # 数据库名称\n\n      username: root                     # 用户名\n\n      password: \"\"                       # 密码\n```\n\n2.项目配置\n\njaffle_shop 工程目录下，有此项目的配置文件，名为`dbt_project.yml`。把`profile`配置项改为`jaffle_shop_tidb`，即`profiles.yml`文件中的工程名称。这样此工程在会到 `~/.dbt/profiles.yml`文件中查询数据库连接配置。\n\n```Bash\n$ cat dbt_project.yml \n\nname: 'jaffle_shop'\n\n\n\nconfig-version: 2\n\nversion: '0.1'\n\n\n\nprofile: 'jaffle_shop_tidb'                   # 注意此处修改\n\n\n\nmodel-paths: [\"models\"]                       # model 路径\n\nseed-paths: [\"seeds\"]                         # seed 路径\n\ntest-paths: [\"tests\"]                         \n\nanalysis-paths: [\"analysis\"]\n\nmacro-paths: [\"macros\"]\n\n\n\ntarget-path: \"target\"\n\nclean-targets:\n\n    - \"target\"\n\n    - \"dbt_modules\"\n\n    - \"logs\"\n\n\n\nrequire-dbt-version: [\">=1.0.0\", \"<2.0.0\"]\n\n\n\nmodels:\n\n  jaffle_shop:\n\n      materialized: table            # models/ 中的 *.sql 物化为表\n\n      staging:           \n\n        materialized: view           # models/staging/ 中的 *.sql 物化为视图\n```\n\n3.验证配置\n\n可以通过以下命令，检测数据库和项目配置是否正确。\n\n```SQL\n$ dbt debug\n\n06:59:18  Running with dbt=1.0.1\n\ndbt version: 1.0.1\n\npython version: 3.8.10\n\npython path: /usr/bin/python3\n\nos info: Linux-5.4.0-97-generic-x86_64-with-glibc2.29\n\nUsing profiles.yml file at /home/ubuntu/.dbt/profiles.yml\n\nUsing dbt_project.yml file at /home/ubuntu/jaffle_shop/dbt_project.yml\n\n\n\nConfiguration:\n\n  profiles.yml file [OK found and valid]\n\n  dbt_project.yml file [OK found and valid]\n\n\n\nConfiguration:\n\n  profiles.yml file [OK found and valid]\n\n  dbt_project.yml file [OK found and valid]\n\n\n\nRequired dependencies:\n\n - git [OK found]\n\n\n\nConnection:\n\n  server: 127.0.0.1\n\n  port: 4000\n\n  database: None\n\n  schema: analytics\n\n  user: root\n\n  Connection test: [OK connection ok]\n\n\n\nAll checks passed!\n```\n\n## 加载 CSV\n\n加载 CSV 数据，把 CSV 具体化为目标数据库中的表。注意：一般来说，dbt 项目不需要这个步骤，因为你的待处理项目的数据都在数据库中。\n\n```Apache\n$ dbt seed\n\n07:03:24  Running with dbt=1.0.1\n\n07:03:24  Partial parse save file not found. Starting full parse.\n\n07:03:25  Found 5 models, 20 tests, 0 snapshots, 0 analyses, 172 macros, 0 operations, 3 seed files, 0 sources, 0 exposures, 0 metrics\n\n07:03:25\n\n07:03:25  Concurrency: 1 threads (target='dev')\n\n07:03:25\n\n07:03:25  1 of 3 START seed file analytics.raw_customers.................................. [RUN]\n\n07:03:25  1 of 3 OK loaded seed file analytics.raw_customers.............................. [INSERT 100 in 0.19s]\n\n07:03:25  2 of 3 START seed file analytics.raw_orders..................................... [RUN]\n\n07:03:25  2 of 3 OK loaded seed file analytics.raw_orders................................. [INSERT 99 in 0.14s]\n\n07:03:25  3 of 3 START seed file analytics.raw_payments................................... [RUN]\n\n07:03:26  3 of 3 OK loaded seed file analytics.raw_payments............................... [INSERT 113 in 0.24s]\n\n07:03:26\n\n07:03:26  Finished running 3 seeds in 0.71s.\n\n07:03:26\n\n07:03:26  Completed successfully\n\n07:03:26\n\n07:03:26  Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3\n```\n\n上述结果中，可以清楚的看到共执行了三个任务，分别加载了 `analytics.raw_customers`、`analytics.raw_orders`、`analytics.raw_payments` 三张表。\n\n接着，去 TiDB 数据库中看看发生了什么。\n\n发现多出了 `analytics` 数据库，这是 dbt 为我们创建的工程数据库。\n\n```SQL\nmysql> show databases;\n\n+--------------------+\n\n| Database           |\n\n+--------------------+\n\n| INFORMATION_SCHEMA |\n\n| METRICS_SCHEMA     |\n\n| PERFORMANCE_SCHEMA |\n\n| analytics          |\n\n| mysql              |\n\n| test               |\n\n+--------------------+\n\n6 rows in set (0.00 sec)\n```\n\n`analytics` 数据库中有三张表，分别对应着上述三个任务结果。\n\n```Gherkin\nmysql> show tables;\n\n+---------------------+\n\n| Tables_in_analytics |\n\n+---------------------+\n\n| raw_customers       |\n\n| raw_orders          |\n\n| raw_payments        |\n\n+---------------------+\n\n3 rows in set (0.00 sec)\n```\n\n## model 是什么？\n\n在进行下一个步骤之前，我们有必要先了解下 dbt 中的 model 扮演着什么角色？\n\ndbt 中使用 model 来描述一组数据表或视图的结构，其中主要有两类文件：SQL 和 YML。还需要注意到的是：在 jaffle_shop 这个项目中，根据[物化配置](https://github.com/dbt-labs/jaffle_shop/blob/main/dbt_project.yml)，`models/` 目录下保存的是表结构，而 `models/staging/` 目录下保存的是视图结构。\n\n以 `models/orders.sql` 为例，它是一句 SQL 查询语句，支持 [jinja](https://jinja.palletsprojects.com/en/3.1.x/) 语法，接下来的命令中，会根据这条 SQL 创建出 `orders` 表。\n\n```C%23\n$ cat models/orders.sql\n\n{% set payment_methods = ['credit_card', 'coupon', 'bank_transfer', 'gift_card'] %}\n\n\n\nwith orders as (\n\n\n\n    select * from {{ ref('stg_orders') }}\n\n\n\n),\n\n\n\npayments as (\n\n\n\n    select * from {{ ref('stg_payments') }}\n\n\n\n),\n\n\n\norder_payments as (\n\n\n\n    select\n\n        order_id,\n\n\n\n        {% for payment_method in payment_methods -%}\n\n        sum(case when payment_method = '{{ payment_method }}' then amount else 0 end) as {{ payment_method }}_amount,\n\n        {% endfor -%}\n\n\n\n        sum(amount) as total_amount\n\n\n\n    from payments\n\n\n\n    group by order_id\n\n\n\n),\n\n\n\nfinal as (\n\n\n\n    select\n\n        orders.order_id,\n\n        orders.customer_id,\n\n        orders.order_date,\n\n        orders.status,\n\n\n\n        {% for payment_method in payment_methods -%}\n\n\n\n        order_payments.{{ payment_method }}_amount,\n\n\n\n        {% endfor -%}\n\n\n\n        order_payments.total_amount as amount\n\n\n\n    from orders\n\n\n\n\n\n    left join order_payments\n\n        on orders.order_id = order_payments.order_id\n\n\n\n)\n\n\n\nselect * from final\n```\n\n并且，与这条 SQL 配套的约束信息在 `models/schema.yml` 文件中。\n\n`schema.yml` 是当前目录下所有模型的注册表，所有的模型都被组织成一个树形结构，描述了每条字段的说明和属性。其中 `tests` 条目表示这个字段的一些约束项，可以通过 `dbt test` 命令来检测，更多信息请查阅[官网文档](https://docs.getdbt.com/docs/building-a-dbt-project/tests)。\n\n```YAML\ncat models/schema.yml\n\nversion: 2\n\n...\n\n  - name: orders\n\n    description: This table has basic information about orders, as well as some derived facts based on payments\n\n\n\n    columns:\n\n      - name: order_id\n\n        tests:\n\n          - unique\n\n          - not_null\n\n        description: This is a unique identifier for an order\n\n\n\n      - name: customer_id\n\n        description: Foreign key to the customers table\n\n        tests:\n\n          - not_null\n\n          - relationships:\n\n              to: ref('customers')\n\n              field: customer_id\n\n\n\n      - name: order_date\n\n        description: Date (UTC) that the order was placed\n\n\n\n      - name: status\n\n        description: '{{ doc(\"orders_status\") }}'\n\n        tests:\n\n          - accepted_values:\n\n              values: ['placed', 'shipped', 'completed', 'return_pending', 'returned']\n\n\n\n      - name: amount\n\n        description: Total amount (AUD) of the order\n\n        tests:\n\n          - not_null\n\n\n\n      - name: credit_card_amount\n\n        description: Amount of the order (AUD) paid for by credit card\n\n        tests:\n\n          - not_null\n\n\n\n      - name: coupon_amount\n\n        description: Amount of the order (AUD) paid for by coupon\n\n        tests:\n\n          - not_null\n\n\n\n      - name: bank_transfer_amount\n\n        description: Amount of the order (AUD) paid for by bank transfer\n\n        tests:\n\n          - not_null\n\n\n\n      - name: gift_card_amount\n\n        description: Amount of the order (AUD) paid for by gift card\n\n        tests:\n\n          - not_null\n```\n\n## 运行\n\n结果中显示成功创建了三张视图（`analytics.stg_customers`、`analytics.stg_orders`、`analytics.stg_payments`）和两张表（`analytics.customers`、`analytics.orders`）。\n\n```Apache\n$ dbt run\n\n07:28:43  Running with dbt=1.0.1\n\n07:28:43  Unable to do partial parsing because profile has changed\n\n07:28:43  Unable to do partial parsing because a project dependency has been added\n\n07:28:44  Found 5 models, 20 tests, 0 snapshots, 0 analyses, 172 macros, 0 operations, 3 seed files, 0 sources, 0 exposures, 0 metrics\n\n07:28:44\n\n07:28:44  Concurrency: 1 threads (target='dev')\n\n07:28:44\n\n07:28:44  1 of 5 START view model analytics.stg_customers................................. [RUN]\n\n07:28:44  1 of 5 OK created view model analytics.stg_customers............................ [SUCCESS 0 in 0.12s]\n\n07:28:44  2 of 5 START view model analytics.stg_orders.................................... [RUN]\n\n07:28:44  2 of 5 OK created view model analytics.stg_orders............................... [SUCCESS 0 in 0.08s]\n\n07:28:44  3 of 5 START view model analytics.stg_payments.................................. [RUN]\n\n07:28:44  3 of 5 OK created view model analytics.stg_payments............................. [SUCCESS 0 in 0.07s]\n\n07:28:44  4 of 5 START table model analytics.customers.................................... [RUN]\n\n07:28:44  4 of 5 OK created table model analytics.customers............................... [SUCCESS 0 in 0.16s]\n\n07:28:44  5 of 5 START table model analytics.orders....................................... [RUN]\n\n07:28:45  5 of 5 OK created table model analytics.orders.................................. [SUCCESS 0 in 0.12s]\n\n07:28:45\n\n07:28:45  Finished running 3 view models, 2 table models in 0.64s.\n\n07:28:45\n\n07:28:45  Completed successfully\n\n07:28:45\n\n07:28:45  Done. PASS=5 WARN=0 ERROR=0 SKIP=0 TOTAL=5\n```\n\n去 TiDB 数据库中验证下，是否真的创建成功。\n\n结果显示多出了 `customers` 等五张表格或视图，并且表或视图中的数据也都转换完成。这里只展示 `customers` 的部分数据。\n\n```SQL\nmysql> show tables;\n\n+---------------------+\n\n| Tables_in_analytics |\n\n+---------------------+\n\n| customers           |\n\n| orders              |\n\n| raw_customers       |\n\n| raw_orders          |\n\n| raw_payments        |\n\n| stg_customers       |\n\n| stg_orders          |\n\n| stg_payments        |\n\n+---------------------+\n\n8 rows in set (0.00 sec)\n\n\n\nmysql> select * from customers;\n\n+-------------+------------+-----------+-------------+-------------------+------------------+-------------------------+\n\n| customer_id | first_name | last_name | first_order | most_recent_order | number_of_orders | customer_lifetime_value |\n\n+-------------+------------+-----------+-------------+-------------------+------------------+-------------------------+\n\n|           1 | Michael    | P.        | 2018-01-01  | 2018-02-10        |                2 |                 33.0000 |\n\n|           2 | Shawn      | M.        | 2018-01-11  | 2018-01-11        |                1 |                 23.0000 |\n\n|           3 | Kathleen   | P.        | 2018-01-02  | 2018-03-11        |                3 |                 65.0000 |\n\n|           4 | Jimmy      | C.        | NULL        | NULL              |             NULL |                    NULL |\n\n|           5 | Katherine  | R.        | NULL        | NULL              |             NULL |                    NULL |\n\n|           6 | Sarah      | R.        | 2018-02-19  | 2018-02-19        |                1 |                  8.0000 |\n\n|           7 | Martin     | M.        | 2018-01-14  | 2018-01-14        |                1 |                 26.0000 |\n\n|           8 | Frank      | R.        | 2018-01-29  | 2018-03-12        |                2 |                 45.0000 |\n\n....\n```\n\n## 生成文档\n\ndbt 还支持生成可视化的文档，命令如下。\n\n1.生成文档\n\n```Apache\n$ dbt docs generate\n\n07:33:59  Running with dbt=1.0.1\n\n07:33:59  Found 5 models, 20 tests, 0 snapshots, 0 analyses, 172 macros, 0 operations, 3 seed files, 0 sources, 0 exposures, 0 metrics\n\n07:33:59\n\n07:33:59  Concurrency: 1 threads (target='dev')\n\n07:33:59\n\n07:33:59  Done.\n\n07:33:59  Building catalog\n\n07:33:59  Catalog written to /home/ubuntu/jaffle_shop/target/catalog.json\n```\n\n2.开启服务\n\n```Apache\n$ dbt docs serve\n\n07:43:01  Running with dbt=1.0.1\n\n07:43:01  Serving docs at 0.0.0.0:8080\n\n07:43:01  To access from your browser, navigate to:  http://localhost:8080\n\n07:43:01\n\n07:43:01\n\n07:43:01  Press Ctrl+C to exit.\n```\n\n可以通过浏览器查看文档，其中包含 jaffle_shop 项目的整体结构以及所有表和视图的描述说明。\n\n![2.png](https://img1.www.pingcap.com/prod/2_5abdeb1f41.png)\n\n## 总结\n\nTiDB 在 dbt 中的使用主要有以下几步：\n\n1. 安装 dbt 和 dbt-tidb\n2. 配置项目\n3. 编写 SQL 和 YML 文件\n4. 运行项目\n\n\n目前，TiDB 支持 dbt 的版本在 4.0 以上，但根据 dbt-tidb [项目文档](https://github.com/pingcap/dbt-tidb)描述，低版本的 TiDB 在和 dbt 结合使用中还存在一些问题，例如：不支持临时表和临时视图、不支持 WITH 语法等。想要痛快的使用 dbt ，建议使用 TiDB 5.3 以上版本，此版本支持 dbt 的全部功能。\n","date":"2022-04-12","author":"PingCAP","fillInMethod":"writeDirectly","customUrl":"when-tidb-meets-dbt","file":null,"relatedBlogs":[]}}},
    "staticQueryHashes": ["1327623483","1820662718","3081853212","3430003955","3649515864","4265596160","63159454"]}