Talk about the development trend of database technology in the past ten years
Looking back over the past few years, many new things have appeared in the field of distributed systems, especially the rise of cloud and AI, which has brought this field that was not very popular in the past to the forefront. During this period, many new technologies and ideas were born This ancient realm is back to life.
No matter what era, storage is an important topic, let's talk about the database today. In the past few years, several obvious trends have emerged in database technology.
Further separation of storage and computing
My earliest attempt at storage-computation separation is Snowflake. The Snowflake team's 2016 paper, "The Snowflake Elastic Data Warehouse", is one of the best big data related papers I have read in recent years. It is especially recommended . The key point of Snowflake's architecture is to store data on stateless computing nodes + an intermediate cache layer + S3. The calculation is not strongly coupled to the cache layer, which is in line with the idea of the cloud. Judging from the recent RedShift hot and cold separation architecture introduced by AWS, AWS also acknowledges that Snowflake's approach is the development direction of advanced productivity. In addition, friends who have paid attention to the database in recent years cannot fail to notice Aurora. Unlike Snowflake, Aurora should be the first product to use the idea of storage-computation separation in an OLTP database and shine. Aurora's success lies in reducing the granularity of data replication from Binlog to Redo Log, greatly reducing the IO amplification on the replication link. In addition, the front-end reuses MySQL, which basically achieves 100% application-level MySQL syntax compatibility, and hosts operation and maintenance, while further expanding the scope of traditional MySQL applications. This is very worry-free in the case of small and medium-sized data volumes. Program.
Although Aurora has achieved commercial success, technically, I don't think there is much innovation. A friend who is familiar with Oracle may see Aurora's architecture for the first time. Oracle used a similar solution about a decade ago, and even perfectly solved the Cache Coherence problem. In addition, Aurora's Multi-Master still has a long way to go. According to recent statements on ReInvent, the main scenario of Aurora's Multi-Master is still a high-availability solution for Single Writer. The essential reason should be the current Multi-Writer uses optimistic conflict detection. The granularity of conflict detection is Page, which will cause a large performance degradation in situations with high conflict rates.
I think Aurora is a good solution to cater to 90% of public cloud Internet users: 100% MySQL compatible, not too concerned about consistency, read far more than write, full hosting. But at the same time, Aurora's architecture determines that it has abandoned 10% of users with extreme needs, such as global ACID transactions + strong consistency, Hyper Scale (above 100 T, and the business is not convenient to disassemble), and requires real-time complex OLAP. This kind of scheme I think is similar to TiDB's Shared-nothing-based design is the only way out. As a distributed systems engineer, I would feel less elegant about any architecture that cannot be scaled horizontally.
ACID is back on the stage with distributed SQL database
Recall that when NoSQL was the most beautiful a few years ago, everyone wanted to transform all systems with NoSQL. Although ease of use, scalability, and performance were good, most NoSQL systems abandoned some of the most important things in the database, such as ACID constraints , SQL, etc. The main promoter of NoSQL is the Internet company. The simple business of the Internet company plus a strong team of engineers, of course, what NoSQL loses can of course be easily done with some tools. But in recent years, people gradually found that the drooping fruits were basically gone, and the rest were hard bones.
The best example is that, as the founder of NoSQL, Google first developed NewSQL (Spanner and F1). In the post-mobile era, services are becoming more complex, requiring more real-time, and at the same time, the demand for data is becoming stronger. Especially for some financial institutions, on the one hand, products are facing the Internet, and on the other hand, ACID is difficult to bypass, regardless of regulatory requirements or business needs. What is more realistic is that most traditional companies do not have the talent supply of top Internet companies. A large number of historical systems are developed based on SQL, and it is definitely not realistic to completely migrate to NoSQL.
In this context, distributed relational databases, I think this is our generation, the last missing part in the market of open source databases, and finally it has gradually become popular. Many details behind this I will not introduce due to space reasons, it is recommended to read an article "From Big Data to Database" by PingCAP TiFlash Technical Leader maxiaoyu, which has a wonderful elaboration on this topic.
Further integration of cloud infrastructure and databases
In the past few decades, database developers seem to be fighting on their own, as if the operating system is completely black box. This assumption is correct, after all, most software developers have no hardware background. In addition, if a solution is too tied to the hardware and the underlying infrastructure, it will inevitably become a de facto standard, and the hardware is very difficult to debug and update, and the cost is too high, which is why I have not been too interested in customizing all-in-ones. However, the emergence of the cloud has transformed the basic capabilities of IaaS into software reusable units. I can rent computing power and services on demand in the cloud, which will bring more possibilities for database developers when designing systems. Sex, to give a few examples:
1. Spanner's native TrueTime API relies on atomic clocks and GPS clocks. If implemented in pure software, many things need to be sacrificed (for example, the HLC of CockroachDB and the improved Percolator model of TiDB are based on the software clock transaction model). But in the long run, both AWS and GCP will provide TrueTime-like high-precision clock services, so that we can better achieve low-latency and long-distance distributed transactions.
2. You can use the services of Fargate + EKS lightweight container + Managed K8s to make the database cope with the scenario of reading hotspots and small tables (this scenario is almost the biggest problem of the Shared-Nothing architecture), such as Raft Learner in TiDB Method, with Cloud Auto Scaler to quickly create read-only replicas in new containers, instead of just providing services through 3 replicas; for example, dynamically starting 10 pods and creating Raft replicas of hot data (this is how we divide TiKV's data into An important reason for the small size of the film design), after processing the burst of read traffic, these containers are destroyed and become 3 copies.
3. Separation of hot and cold data. This is well understood. Unusual data shards, analytical copies, and data backups are placed on S3, which greatly reduces costs.
4. RDMA / CPU / Supercomputing as a Service, any improvement on the hardware level on the cloud, as long as the API is exposed, it can bring new benefits to software developers.
There are many more examples, so I won't list them one by one. In short, my point is that the capabilities of cloud service APIs will be like the standard library of code in the past, something that everyone can rely on. Although the public cloud SLA is still not ideal, in the long run, it will definitely be more and more perfect.
So, where is the future of the database? Is it more vertical or unified? For this question, I agree that there is no silver bullet in this world, but I am not as pessimistic as my idol, Dr. AWS CTO Vogels, believe that the future is a fragmented world (AWS ca n’t wait to design a database for each segmented scenario ). Excessive segmentation can increase the cost of data flowing through different systems. There are two keys to solving this problem:
What granularity should the data product be divided into?
Can users not know what's going on behind the scenes?
The first question does not have a clear answer, but I think it is definitely not as fine as possible, and this is related to Workload. For example, if there is no such large amount of data, running an analysis query directly on MySQL or PostgreSQL has no problem at all. There is no need to use Redshift. Although there is no direct answer, I vaguely think that the first question and the second question are closely related. After all, there is no silver bullet, just like OLAP must run faster on the column storage engine than the row storage engine, but in fact for the user Can be all SQL interfaces.
SQL is a very good language. It only describes the user's intentions, and it is completely independent of the implementation. For databases, it can be sliced behind the SQL layer. In TiDB, it is a good idea to introduce TiFlash. example of. The motivation is simple:
1. Users are not actually database experts. You can't expect users to use the right database 100% at the right time and use it correctly.
2. Data synchronization can only keep as much information as possible under a system. For example, TiFlash can maintain the MVCC version of transactions in TiDB. TiFlash's data synchronization granularity can be as small as the Raft Log level. In addition, some new functions can still be provided externally through the interface of SQL, such as full-text search. Actually, SQL can also be expressed concisely. I won't expand one by one here.
In fact, I firmly believe that the system must be developed in the direction of smarter and easier to use. Now it is the 21st century. Do you want to hold a Nokia and a camera every day, or just a mobile phone?
Always think about the question, why do we need to be distributed? To a large extent, it may be a last resort. If Moore's Law does not fail, if the growing computing and storage needs of the Internet can be solved with low-cost hardware, will we not need to be distributed.
In the past two or three decades, a software company has rescued itself, a vast revolution. The development of distributed technology has profoundly changed the mode of our programming and changed the way we think about software. With X86 or Arm machines everywhere, it has built an infinitely expanded computing and storage capacity, which is the most romantic self-salvation for software engineers.
Come on, start learning, learning to change your destiny. As an old brand IT certification training institution, there are senior CCIE EXAM training, HCIE EXAM training, etc. PASSHOT will be your best helper on the road to success.