Exploring TiDB HTAP
This guide explains how to explore and use TiDB HTAP, or Hybrid Transactional and Analytical Processing, features.
Use Cases
TiDB HTAP can process rapidly growing large-scale data, reduce DevOps costs, and be deployed easily in on-premises or cloud environments, so it can deliver the value of data assets in real time.
The following are common HTAP use cases.
-
Hybrid workloads
In a hybrid workload scenario, when using TiDB for real-time OLAP, or Online Analytical Processing, you only need to provide a TiDB entry point for the data. TiDB automatically selects different processing engines based on the specific business workload.
-
Real-time stream processing
When using TiDB in a real-time stream processing scenario, TiDB makes it possible to query all continuously incoming data in real time. At the same time, TiDB can handle highly concurrent data workloads and business intelligence, or BI, queries.
-
Data hub
When using TiDB as a data hub, TiDB can seamlessly connect application data and data warehouses to satisfy specific business requirements.
For more information about TiDB HTAP use cases, see the HTAP-related blog posts on the PingCAP website.
Architecture
In TiDB, the row-based storage engine TiKV for OLTP, or Online Transactional Processing, and the columnar storage engine TiFlash for OLAP, or Online Analytical Processing, are used together to automatically replicate data and maintain strong consistency.
For more information about the architecture, see TiDB HTAP Architecture.
Prepare the Environment
Before checking TiDB HTAP features, you need to deploy TiDB and the corresponding storage engines according to the amount of data. For large amounts of data, such as 100 TB, it is recommended to use TiFlash Massively Parallel Processing, or MPP, as the primary solution and TiSpark as a secondary solution.
-
TiFlash
-
If you deployed a TiDB cluster without TiFlash nodes, add TiFlash nodes to the current TiDB cluster. For details, see Scale out a TiFlash cluster.
-
If you have not deployed a TiDB cluster, see Deploy a TiDB Cluster Using TiUP. You need to deploy a TiFlash topology based on the minimal TiDB topology.
-
When deciding how many TiFlash nodes to choose, consider the following scenarios.
- If the use case requires OLTP with small-scale analysis and ad hoc queries, deploy at least one TiFlash node. It can greatly improve the speed of analytical queries.
- If OLTP throughput does not put heavy pressure on TiFlash node I/O utilization, each TiFlash node can use more resources for computation, so the TiFlash cluster can have nearly linear scalability. Adjust the number of TiFlash nodes according to expected performance and response time.
- If OLTP throughput is relatively high, for example if write or update throughput exceeds 10 million rows per hour, the limited write capacity of the network and physical disks can create an I/O bottleneck between TiKV and TiFlash and tends to create read and write hotspots. In this case, the number of TiFlash nodes has a complex nonlinear relationship with the amount of analytical processing computation, so it must be adjusted based on the actual system situation.
-
-
TiSpark
- If you need to analyze data with Spark, deploy TiSpark. For the detailed process, see the TiSpark user guide.
Prepare Data
After TiFlash is deployed, TiKV does not automatically replicate data to TiFlash. You must manually specify the tables that need to be replicated to TiFlash. TiDB then creates the corresponding TiFlash replicas.
- If the TiDB cluster has no data, migrate data to TiDB first. For details, see Data Migration.
- If the TiDB cluster already has data replicated from upstream, data replication does not start automatically after TiFlash is deployed. You must manually specify the tables to replicate to TiFlash. For details, see Use TiFlash.
Process Data
With TiDB, you can simply enter SQL statements for query or write requests. For tables that contain TiFlash replicas, TiDB uses the frontend optimizer to automatically select the optimal execution plan.
Performance Monitoring
When using TiDB, you can monitor the status and performance metrics of a TiDB cluster by using one of the following methods.
- TiDB Dashboard: check the overall running status of the TiDB cluster, analyze the distribution and trends of read and write traffic, and learn detailed execution information for slow queries.
- Monitoring system, Prometheus and Grafana: view monitoring parameters for TiDB cluster-related components, including PD, TiDB, TiKV, TiFlash, TiCDC, and Node_exporter.
To view alert rules for TiDB clusters and TiFlash clusters, see TiDB cluster alert rules and TiFlash alert rules.
Troubleshooting
If problems occur while using TiDB, see the following documents.
- Analyze slow queries
- Identify expensive queries
- Troubleshoot hotspot issues
- TiDB cluster troubleshooting guide
- Troubleshoot TiFlash clusters
You can also create a GitHub issue or submit a question to AskTUG.
What’s Next
- To view the TiFlash version, important logs, and system tables, see Maintain a TiFlash cluster.
- To delete a specific TiFlash node, see Scale out a TiFlash cluster.
Explore HTAP last modified 2022-07-08 11:48:44: tiflash refactor: split use-tiflash into multiple docs (#9452) (#9521)