TiDB HTAP Quick Start Guide
This guide explains the easiest way to get started with TiDB’s one-stop solution for Hybrid Transactional and Analytical Processing, or HTAP.
Basic Concepts
Before using TiDB HTAP, you need some basic knowledge of TiKV, the row-based storage engine for TiDB OLTP, or Online Transactional Processing, and TiFlash, the columnar storage engine for TiDB OLAP, or Online Analytical Processing.
- HTAP storage engine: HTAP lets row-based storage engines and columnar storage engines coexist. Both storage engines can automatically replicate data and maintain strong consistency. The row-based storage engine optimizes OLTP performance, and the columnar storage engine optimizes OLAP performance.
- HTAP data consistency: as a distributed transactional key-value database, TiKV provides an ACID-compliant transaction interface and guarantees data consistency across replicas and high availability through the Raft consensus algorithm. As TiKV’s columnar storage extension, TiFlash replicates data from TiKV in real time according to the Raft Learner consensus algorithm.
- HTAP data isolation: to solve HTAP resource isolation issues, TiKV and TiFlash can be deployed on different systems as needed.
- MPP computing engine: MPP is a distributed computing framework provided by the TiFlash engine in TiDB 5.0 and later. It enables data exchange between nodes and provides high-performance, high-throughput SQL algorithms. Using MPP mode can greatly reduce analytical query execution time.
Procedure
In this document, you can experience the convenience and high performance of TiDB HTAP by querying sample tables from the TPC-H dataset. TPC-H is a common decision support benchmark made up of a large amount of data and a very complex set of business-oriented ad hoc queries. To experience all 22 SQL queries with TPC-H, access the tidb-bench repository or TPC-H to learn how to generate query statements and data.
Step 1. Deploy a local test environment
Before using TiDB HTAP, prepare a local test environment as described in the Quick Start Guide for the TiDB Database Platform and run the following command to deploy a TiDB cluster.
tiup playground
Step 2. Prepare test data
In the following procedure, you can create a TPC-H dataset as test data for using TiDB HTAP. If you are interested in TPC-H, see the general implementation guide.
-
Run the following command to install the test data generation tool.
tiup install bench -
Run the following command to generate test data.
tiup bench tpch --sf=1 prepareIf
Finishedappears in the output of this command, the data has been generated. -
Run the following SQL statement to display the generated data.
SELECT CONCAT(table_schema,'.',table_name) AS 'Table Name', table_rows AS 'Number of Rows', FORMAT_BYTES(data_length) AS 'Data Size', FORMAT_BYTES(index_length) AS 'Index Size', FORMAT_BYTES(data_length+index_length) AS'Total' FROM information_schema.TABLES WHERE table_schema='test';As shown in the output, a total of 8 tables are created and the largest table has 6.5 million rows. Because data is generated randomly, the number of rows created by the tool differs from the actual SQL query result.
+---------------+----------------+-----------+------------+-----------+ | Table Name | Number of Rows | Data Size | Index Size | Total | +---------------+----------------+-----------+------------+-----------+ | test.nation | 25 | 2.44 KiB | 0 bytes | 2.44 KiB | | test.region | 5 | 416 bytes | 0 bytes | 416 bytes | | test.part | 200000 | 25.07 MiB | 0 bytes | 25.07 MiB | | test.supplier | 10000 | 1.45 MiB | 0 bytes | 1.45 MiB | | test.partsupp | 800000 | 120.17 MiB| 12.21 MiB | 132.38 MiB| | test.customer | 150000 | 24.77 MiB | 0 bytes | 24.77 MiB | | test.orders | 1527648 | 174.40 MiB| 0 bytes | 174.40 MiB| | test.lineitem | 6491711 | 849.07 MiB| 99.06 MiB | 948.13 MiB| +---------------+----------------+-----------+------------+-----------+ 8 rows in set (0.06 sec)This is the database of a commercial order system. The
test.nationtable represents information about nations,test.regionregions,test.partparts,test.suppliersuppliers,test.partsuppsupplier parts,test.customercustomers,test.ordersorders, andtest.lineitemline items.
Step 3. Query data using the row-based storage engine
To check the performance of TiDB using only the row-based storage engine, run the following SQL statement.
SELECT
l_orderkey,
SUM(
l_extendedprice * (1 - l_discount)
) AS revenue,
o_orderdate,
o_shippriority
FROM
customer,
orders,
lineitem
WHERE
c_mktsegment = 'BUILDING'
AND c_custkey = o_custkey
AND l_orderkey = o_orderkey
AND o_orderdate < DATE '1996-01-01'
AND l_shipdate > DATE '1996-02-01'
GROUP BY
l_orderkey,
o_orderdate,
o_shippriority
ORDER BY
revenue DESC,
o_orderdate
limit 10;
This is a shipping priority query that provides the priority and potential revenue of the highest-revenue orders that have not been shipped by a specified date. Potential revenue is defined as the sum of l_extendedprice * (1-l_discount). Orders are listed in descending order of revenue. In this example, the query lists the top 10 unshipped orders.
Step 4. Replicate test data to the column-oriented storage engine
After TiFlash is deployed, TiKV does not immediately replicate data to TiFlash. To specify the tables that need to be replicated, run the following DDL statements in the TiDB MySQL client. TiDB then creates the specified replicas in TiFlash accordingly.
ALTER TABLE test.customer SET TIFLASH REPLICA 1;
ALTER TABLE test.orders SET TIFLASH REPLICA 1;
ALTER TABLE test.lineitem SET TIFLASH REPLICA 1;
To check the replication status of a specific table, run the following statements.
SELECT * FROM information_schema.tiflash_replica WHERE TABLE_SCHEMA = 'test' and TABLE_NAME = 'customer';
SELECT * FROM information_schema.tiflash_replica WHERE TABLE_SCHEMA = 'test' and TABLE_NAME = 'orders';
SELECT * FROM information_schema.tiflash_replica WHERE TABLE_SCHEMA = 'test' and TABLE_NAME = 'lineitem';
Results of the statements above:
AVAILABLEindicates whether the TiFlash replica of a specific table is available.1means available, and0means unavailable. Once a replica becomes available, this status no longer changes. If the number of replicas is changed with a DDL statement, the replication status is recalculated.PROGRESSmeans replication progress. The value is from 0.0 to 1.0. 1 means that at least one replica has been replicated.
Step 5. Analyze data faster with HTAP
You can rerun the SQL statement from step 3 to check TiDB HTAP performance.
For tables with TiFlash replicas, the TiDB optimizer automatically decides whether to use a TiFlash replica based on cost estimation. You can use desc or explain analyze statements to check whether a TiFlash replica is selected. For example:
explain analyze SELECT
l_orderkey,
SUM(
l_extendedprice * (1 - l_discount)
) AS revenue,
o_orderdate,
o_shippriority
FROM
customer,
orders,
lineitem
WHERE
c_mktsegment = 'BUILDING'
AND c_custkey = o_custkey
AND l_orderkey = o_orderkey
AND o_orderdate < DATE '1996-01-01'
AND l_shipdate > DATE '1996-02-01'
GROUP BY
l_orderkey,
o_orderdate,
o_shippriority
ORDER BY
revenue DESC,
o_orderdate
limit 10;
If the result of the EXPLAIN statement shows ExchangeSender or ExchangeReceiver operators, MPP mode is enabled.
You can also specify that each part of the full query be computed only by using the TiFlash engine. For details, see reading TiFlash replicas with TiDB.
You can compare the query results and query performance of these two methods.
What’s Next
- TiDB HTAP architecture
- Explore HTAP
- Use TiFlash
Quick start with HTAP last modified 2022-07-08 11:48:44: tiflash refactor: split use-tiflash into multiple docs (#9452) (#9521)