Metabase Clickhouse



This will launch a Metabase server on port 3000 by default. You can use docker logs -f metabase to follow the rest of the initialization progress. Once the Metabase startup completes you can access the app at localhost:3000. Since Docker containers have their own ports and we just map them to the system ports as needed it’s easy to move Metabase onto a different system port if you wish. Feb 20, 2021 Metabase works with a wide range of database solutions, including Google Analytics, Google BigQuery, Amazon Athena, Yandex ClickHouse, Oracle, etc. Metabase has one of the simplest installation systems.

  1. Starter

    The user-friendly Cloud BI suite in a box.

    • Unlimited charts
    • Unlimited dashboards
    • Connect to 20+ database types
    • Use 15+ visualizations
    • Schedule updates via email or Slack
    • Fully managed cloud
    • Automated upgrades and backups
    • Out-of-the-box SMTP setup
    • Migrate from open source
    Support
    3-day email
    Deployment
    Metabase Cloud

    $85/month

    Incl. 5 users, then $5/user/month

  2. Growth

    Everything from Starter, plus:

    • Single sign-on via SAML, JWT, or advanced LDAP
    • Soon: choose your cloud region
    Support
    3-day email
    Deployment
    Metabase Cloud

    $749/month

    Incl. 10 users, then $15/user/month

  3. Enterprise

    Everything from Growth, plus:

    • Auditing and compliance tools
    • Row-level permissions
    • Embedded analytics
    • Customize logo, colors, and more
    • Source modification license
    Support
    Priority support
    Deployment
    Metabase Cloud or self-hosted

    Starts at $15k/year

    Per-user varies · annual billing

  1. Open Source

    The user-friendly BI suite in a box.

    • Unlimited charts
    • Unlimited dashboards
    • Connect to 20+ database types
    • Use 15+ visualizations
    • Schedule updates via email or Slack
    Support
    Our friendly forum
    Deployment
    Self-hosted
  2. Enterprise

    Everything from Growth, plus:

    • Single sign on (SAML, JWT, advanced LDAP)
    • Auditing and compliance tools
    • Row-level permissions
    • Embedded analytics
    • Customize logo, colors, and more
    • Source modification license
    Support
    Priority support
    Deployment
    Metabase Cloud or self-hosted

    Starts at $15k/year

    Per-user varies · annual billing

We would like to show you a description here but the site won’t allow us. 没有声音,再好的戏也出不来 同样,没有可视化,再好的数据分析也不完美. 数据可视化是大数据的『最后一公里』 简介. 使用metabase连接Clickhouse需要的驱动,当前版本0.7.3适用于metabase 0.37.3.

Click on the right to follow , The strongest official account in the field of big data development !

big data It's fun

Click on the right to follow , Big data is fun !

Hbase、Kudu and ClickHouse Horizontal contrast V2.0

Preface

Hadoop There are many technologies in the ecosystem .HDFS Has been used to hold underlying data , Strong position .Hbase As a Nosql It's also Hadoop The core component of the ecosystem , Its massive storage capacity , Excellent random reading and writing skills , Be able to handle some HDFS What's not enough .Clickhouse It's an on-line analysis (OLAP) Column database management system (DBMS). Able to use SQL Query real-time generation analysis data report . It also has excellent data storage capabilities .

Apache Kudu yes Cloudera Manager company 16 New distributed storage system released in , combination CDH and Impala Use can solve random read and write and sql The problem of data analysis . Make up separately HDFS Static storage and Hbase Nosql Deficiency .

Since there are so many technical routes available , This article will deploy from the installation 、 Architecture Composition 、 Basic operation and other aspects of horizontal comparison Hbase、Kudu and Clickhouse. In addition, the practice of several large factories is introduced here as an example for reference .

Comparison of installation and deployment methods

The specific installation steps are not detailed , Here is a brief comparison of the external components that you need to rely on during the installation process .

Habse install

rely on HDFS As the underlying storage plug-in rely on Zookeeper As a metadata storage plug-in

Kudu install

rely on Impala As an auxiliary analysis plug-in rely on CDH Cluster as a management plug-in , But it's not a must , It can also be installed separately

Clickhouse install

rely on Zookeeper As metadata storage plug-ins and Log Service And watch it catalog service

Composition architecture comparison

Hbase framework

Kudu framework

Clickhouse framework

To sum up ,Hbase and Kudu It's all like Master-slave And Clickhouse non-existent Master structure ,Clickhouse Every one of them Server All of them are equivalent , yes multi-master Pattern . however Hbase and Clickhouse An extra... Has been added Zookeeper As an auxiliary metadata store or log server etc. , and Kudu The metadata of is Master Managed , for fear of server Frequently from Master Read metadata ,server From Master Get a metadata to the local , But there is a risk of metadata loss .

Basic operation comparison

Data reading and writing operations

•Hbase Reading process

•Hbase Writing process

•Kudu

•Clickhouse

Clickhouse It's an analytical database . In this case , The data is generally constant , therefore Clickhouse Yes update、delete The support of the government is relatively weak , It doesn't actually support the standard update、delete operation .

Clickhouse adopt alter How to update 、 Delete , It is the update、delete The operation is called mutation( mutation ).

standard SQL Update 、 Deletion is synchronous , That is, the client has to wait for the server to reverse the execution result ( Usually int value ); and Clickhouse Of update、delete It's done asynchronously , When executed update When the sentence is , The server immediately reverses , But in fact, the data hasn't changed at this time , It's waiting in line .

Mutation The specific process

First , Use where Condition to find the partition to be modified ; then , Rebuild each partition , Replace the old with a new partition , Once the partition is replaced , You can't go back ; For each partition , It can be considered atomic ; But for the whole mutation, If multiple partitions are involved , It's not atomic .

• The update feature does not support updating columns about primary keys or partition keys • Update operations are not atomic , In the process of updating select It's likely that part of it has changed , Part of it hasn't changed , From the specific process above, we can know • Updates are performed in the order of submission • Once the update is submitted , Cannot be revoked , Even if rebooted Clickhouse service , Will continue to follow system.mutations Continue with the order of • Items that have been updated will not be deleted immediately , The number of reserved entries is determined by finished_mutations_to_keep Storage engine parameters are determined . When the amount of data exceeds, the old entry will be deleted • Updates may get stuck , such as update intvalue='abc’ This type of incorrect UPDATE statement cannot be executed , So it's stuck here all the time , here , have access to KILL MUTATION To cancel

To sum up ,Hbase Random, speaking, reading and writing , however Hbase Of update The operation is not real update, Its actual operation is insert A new piece of data , Put on a different timestamp, The old data will be deleted automatically after the expiration date . and Clickhouse Just don't support update and delete.

Data query operation

•Hbase

Standards... Are not supported sql, Integration required Phoenix plug-in unit .Hbase I have Scan operation , But implementation is not recommended , Generally, full scan will cause the cluster to crash

•Kudu

And Impala Integrated query implementation

•Clickhouse

It has excellent query performance

HBASE Application scenarios and best practices in didi travel

HBase Didi mainly stores the following four types of data :

• The statistical results 、 Report data : Mainly operation 、 Capacity situation 、 Income and so on , Usually requires cooperation Phoenix Conduct SQL Inquire about . The amount of data is small , It requires high query flexibility , The delay requirement is general .• Raw fact data : Like an order 、 Driver, passenger GPS The trajectory 、 Log etc. , It is mainly used for online and offline data supply . Large amount of data , High requirements for consistency and availability , Delay sensitive , Write in real time , Single point or batch query .• Intermediate result data : Refers to the data needed for model training . Large amount of data , Usability and consistency requirements are general , It requires high throughput in batch query .• Backup data for online systems : The user stores the original data in other relational databases or file services , hold HBase As a remote disaster recovery program .

Order event

Inquiries about recent orders will fall on Redis, Beyond a certain time frame , Or when Redis Is not available , The query will fall on HBase On . The business needs are as follows :

• Query the status of the order lifecycle online , Include status、event_type、order_detail Etc . The main query comes from the customer service system • Online historical order details query . There will be Redis To store recent orders , When Redis Not available or the query range is beyond Redis, The query will go directly to HBase• Offline analysis of order status • Write to satisfy every second 10K Events , Read to satisfy per second 1K Events , Data requirements are in 5s Available in

According to these requirements , We are right. Rowkey Made the following design , They are very typical scan scene .

• Order status table

Rowkey:reverse(order_id) + (MAX_LONG - TS)

Columns: Various states of the order

• Order history table

Rowkey:reverse(passenger_id | driver_id) + (MAX_LONG - TS)

Columns: User's orders and other information within the time frame

Driver, passenger trajectory

Let me give you a few examples of usage scenarios : When users view historical orders , The map shows the route you've taken ; There is a dispute between drivers and passengers , The customer service calls the order track to reproduce the scene ; Map users analyze road congestion .

Users' demands :

• Satisfy App Real time or quasi real time trajectory coordinate query by users or back-end analysts ;• Meet offline large-scale trajectory analysis ;• It is satisfied to give a specified geographical range , Take out the track of all users in the range or users who have appeared in the range . among , About the third requirement , Geographical location inquiry , We know MongoDB For this kind of geographic index active generation support , But in the case of didi of this magnitude, there may be a storage bottleneck ,HBase There is no pressure on storage and scalability, but there is no built-in similar MongoDB The function of geolocation index , If we don't, we need to realize it ourselves . Through research , Learn about the geographic index has a more general set of GeohHash Algorithm . hold GeoHash And other dimensions that need to be indexed Rowkey, Actual GPS Points for Value, On this basis, it is packaged as a client , And in the client internal query logic and query strategy to make a substantial optimization on the speed , This way HBase Become a MongoDB Like databases that support geographic indexing . If the query range is very large ( For example, provincial level analysis ), There's an extra MR Access to the data . Two query scenarios Rowkey The design is as follows :• A single user queries by order or time period :reverse(user_id) + (Integer.MAX_LONG-TS/1000)• Track queries within a given range :reverse(geohash) + ts/1000 + user_id

ETA

ETA It refers to each time you select the starting point and destination , The estimated time and price of the prompt . The estimated time of arrival and price of the prompt , The original version was run offline , Later, the revision was passed HBase Real time effect , hold HBase Think of it as a KeyValue cache , It brings less training time 、 Many cities can go parallel 、 Reduce the benefits of manual intervention . Whole ETA The process is as follows :

1. Model training through Spark Job, Every time 30 Train for each city every minute 2. The first stage of model training , stay 5 Within minutes , According to the set conditions from HBase Read all the city data 3. The second stage of model training is in 25 Done in minutes ETA The calculation of 4.HBase The data in is persisted to at regular intervals HDFS in , For new model testing and new feature extraction

Rowkey:salting+cited+type0+type1+type2+TS Column:order, feature

Monitoring tools DCM

Clickhouse Database

Used to monitor Hadoop Cluster resource usage (Namenode,Yarn container Use, etc ), After the process of time dimension, relational database will have various performance problems , At the same time, we hope that we can pass SQL Do some analysis and query , So use Phoenix, Use the acquisition program to input data regularly , Production report , Deposit in HBase, You can return query results at the second level , At the end of the presentation

Summary

In didi promotion and practice HBase At work , We think that the most important two points are to help users make good table structure design and resource control . With these two premises , The probability of subsequent problems will be greatly reduced . A good table structure design requires users to HBase There is a clear understanding of the implementation of , Most business users focus more on business logic , Little knowledge of architecture implementation , This requires platform managers to constantly help and guide , After a good start and successful cases , Through these users to promote to other business parties . Resource isolation control helps us reduce the number of clusters effectively , Lower maintenance cost , Let the platform manager free from the endless management of multi cluster , Put more energy into the component community follow-up and platform management system research and development , Make the business and platform enter a virtuous circle , Improve user experience , Better support the development of the company's business .

Netease koala is based on KUDU Practice of building real-time traffic data warehouse

Kudu It not only provides row level insertion 、 to update 、 Delete API, It also provides access to Parquet Batch scan operations for performance . Use the same copy of storage , It can read and write randomly , It can also meet the requirements of data analysis

Real time flow / Business data writing

have access to Spark Streaming Provided KafkaUtils.createDirectStream Method to create a corresponding topic Of Dstream. This method topic Of offset It will be completely under our own control

write in Kudu There are several steps :

Metabase clickhouse

• obtain kafka offset• establish KuduContext• Define write Kudu Tabular schema• According to the parsing logic, the traffic log is parsed and constructed DataFrame• take df Insert Kudu And submit offset

Write performance test

Kudu Write table : haitao_dev_log.dwd_kl_flw_app_rt, Fragmentation : 240 individual ,Spark Streaming Scheduling interval : 15s

You can find 75% All the tasks of 1s complete , There are a few tasks that are slower, but the overall performance is 2s It's all over , Here's an extreme case to be aware of , because Spark The default configuration concurrency number for is 1, If there's a process that's not finished ( Generally, the data is unevenly distributed ) Then the next task will be queued , You don't schedule the next task until the first task is finished . This will cause waste of resources , Multiple free executor Will be waiting for the last one executor, Traffic logs do not require sequential insertion , So we can increase the number of concurrent tasks . Specific parameter settings

Summary

Now write in real time Kudu Of the traffic logs in the daily billions , The amount of writing is in TB level , And there are business dependencies such as real-time traffic dismantling Kudu The underlying traffic data of , There will be more lines of business moving to Kudu In order to meet the analysis needs of different dimensions

Ctrip CLICKHOUSE Log analysis practice

Combined with Ctrip's log analysis scenario , Log entry ES Has been formatted as JSON, The same type of log has a unified Schema, accord with ClickHouse Table The pattern of ; Log query time , Generally, the quantity is counted according to a certain dimension 、 Total amount 、 Mean, etc , accord with ClickHouse Use scenarios for column storage . Occasionally, there are a few scenarios that require fuzzy queries on strings , After filtering a large amount of data through some conditions , Then a small amount of data is fuzzy matched ,ClickHouse Can also be very good at . In addition, we found that 90% The above logs are not used ES Full text indexing features of , So we decided to try ClickHouse To handle the log .

Consumption data to CLICKHOUSE

We use gohangout Consumption data to ClickHouse, Some suggestions on data writing :

• Write by polling ClickHouse All servers in the cluster , Ensure that the data is basically evenly distributed .• Large batch of low frequency write , Reduce parts Number , Reduce servers merge, avoid Too many parts abnormal . Two thresholds are used to control the amount and frequency of data writing , exceed 10w Record once or 30s Write once .• Write local tables , Don't write distributed tables , Because the distributed table will split the data into multiple parts, And forward the data to other servers , It will increase the network traffic between servers 、 The server merge The amount of work is increasing , Causes write speed to slow down , And added Too many parts The possibility of .• When building a table, consider partition Set up , I've met people who will partition Set to timestamp, Cause insert data to be reported all the time Too many parts It's abnormal . We are usually gifted partition.• Primary key and index settings 、 Data out of order and so on can also lead to slower write .

Data presentation

We investigated things like Supperset、Metabase、Grafana Wait for a few tools , In the end, I decided to use it in Kibana3 Upper development support ClickHouse Realize chart display . The main reason is Kibana3 This powerful data filtering function , Many systems don't have , In addition, considering the high cost of migrating to other systems , It's hard for users to adapt in the short term .

Query optimization

Kibana Medium Table Panel Used to display the detailed data of the log , General query recent 1 Hour data for all fields , In the end, it's just before the show 500 Bar record . This scene is for ClickHouse It's very unfriendly .

In response to this question , We will table Panel The query is conducted twice : The amount of data per unit time interval is queried for the first time , According to the amount of data finally displayed, calculate the reasonable query time range ; The second time according to the revised time frame , combination Table Panel Is displayed by default Column Query details .

Clickhouse.metabase-driver.jar

After these optimizations , The query time can be shortened to the original 1/60, The columns of the query can be reduced 50%, Finally, the amount of query data is reduced to the original 1/120;ClickHouse A variety of approximate calculation methods are provided , It is used to provide relatively high accuracy while reducing the amount of calculation ; Use MATERIALIZED VIEW perhaps MATERIALIZED COLUMN Put the calculation in the usual way , It can also effectively reduce the amount of query data and calculation .

CLICKHOUSE Basic operation and maintenance

On the whole ClickHouse The operation and maintenance ratio of ES Simple , It mainly includes the following aspects of work :

• Access to new logs 、 performance optimization ;• Cleaning up overdue logs , We delete expired logs every day through a scheduled task partition;•ClickHouse Monitoring of , Use ClickHouse-exporter+VictoriaMetrics+Grafana The implementation of the ;• Data migration , adopt ClickHouse The feature of distributed table is that we usually do not move historical data , Just connect the new data to the new cluster , Then query across clusters through distributed tables . as time goes on , Historical data will be cleaned up and taken offline , When all the old cluster data is offline , The migration of new and old clusters is completed . When you really need to migrate data , use ClickHouse_copier Or copy the data .• Handling of common problems :

The slow query , adopt kill query Terminate the execution of a slow query , And through the optimization scheme mentioned above to optimize

Too many parts abnormal :Too many parts The exception is due to a write part Too much part Of merge The speed doesn't keep up with the speed produced , Lead to part There are several reasons for this :

• The setting is unreasonable • Small batch 、 Write high frequency ClickHouse• Is written ClickHouse Distributed tables for •ClickHouse Set up merge Too few threads

• Can't start : I met before ClickHouse Can't start problem , It includes two aspects :

Summary

Log from ES Migrate to ClickHouse Can save more server resources , Lower overall operation and maintenance costs , And it improves the query speed , Especially when the user is in emergency troubleshooting , This kind of query speed is doubled , The user experience has been significantly improved .

however ClickHouse After all, it's not ES, In many business scenarios ES Still irreplaceable ;ClickHouse It's not just about logs , Further study ClickHouse, Give Way ClickHouse Greater value in more areas , It's what we've been trying to do

summary

Here is a brief summary of the characteristics of the three .

Clickhouse git

First of all Hbase And Kudu, Can be said to be Kudu Master Hbase, The architecture is similar master-slave structure .Hbase The physical model of is master and regionserver,regionserver What's stored is region,region There are a lot of store, One store Corresponding to a column cluster , One store There is one of them. memstore And multiple storefile,store The bottom is hfile,hfile yes hadoop Binary file , among HFile and HLog yes Hbase Two file storage formats ,HFile For storing data ,HLog Make sure you can write to HFile in .

Kudu The physical model of is master and tserver, among table according to hash and range Partition , Divided into several tablet Store in tserver in ,tablet It is divided into leader and follower,leader Responsible for writing requests ,follower Read request , In conclusion , One ts It can serve multiple tablet, One tablet Can be multiple ts service ( be based on tablet The partition , The lowest is 2 Zones ).

and Clickhouse Is characterized by its claim to be the fastest query performance , Although it can also store data , But it's not his strength , and Clickhouse Also can't update/delete data .

Finally, from the following dimensions to compare Hbase、Kudu and Clickhouse.

therefore Hbase More suitable for unstructured data storage ; In scenarios that require both random read-write and real-time updates ,Kudu+Impala Can be very competent , Of course, combine it again CDH That's better. , The bottleneck is not Kudu, And in the Impala Of Apache Deploy . If you only require the ability to query static data at high speed ,Clickhouse Better .

In this paper, the reference

[1] https://4m.cn/xKqzL, author :super_chenzhou

Metabase sqlite

Copyright notice :

This paper is about big data technology and architecture , Exclusive authorization by the original author . Reprint without the permission of the original author shall be investigated for tort liability .

edit | Cold eye

Metabase

WeChat official account |import_bigdata

Welcome to thumb up + Collection + Forward the quality of the circle of friends