On the Internet we talk only about PostgreSQL and MySQL, but the choice of DBMS is much wider. In this material we will analyze several popular databases, find out their characteristics and use cases to overcome common solutions.
"Databases are easy," they said
It seems that five seconds is enough to deploy any database. Install Docker, enter the command:
docker run postgres
That's right, you're awesome! In reality, every administrator still had a lot of work to do on this task. First, you had to find the right hardware for the DBMS. Set up your operating system and regular updates. Organize a monitoring and backup program. Fix replication problems and put together a cluster. To save time, companies often choose off-the-shelf solutions from vendors who have already performed these tasks for each product.
Two main factors influence the choice of a DBMS for a project:
The team has experience with a specific solution and sees no point in diving into uncharted waters.
If this is not the first database in the project, a new one is selected taking into account the implemented infrastructure. First of all, the configuration of operating systems and equipment from which data is received/sent is taken into account. They take into account the analytical data collection capabilities and the capabilities of other databases. In addition, attention must be paid to how resources are allocated and their consumption is controlled.
Let's take a look at some popular DBMS systems that work independently or can be great partners for jointly implementing more complex and faster solutions.
Features of PostgreSQL
The database is object-relational and is often used as the main database in projects of various sizes. In brief, data handling can be described as follows: data is stored in tables, indexed for fast searching, and covered with complex dependencies. Indexes help you organize data according to book index logic, so you don't have to spend time searching the entire database. Dependencies are necessary to implement a flexible system of triggers for writing and reading data. For example, a bank employee can use them to retrieve information about all customers who were given mortgages on Monday.
Reliability and replication
PostgreSQL handles complex queries, large workloads, and writes important data to disk. To prevent data loss due to server failure, a transaction log (WAL) is first recorded. In case of an emergency, you will be able to recover. The WAL is also involved in both types of streaming replication:
Asynchronous replication. PostgreSQL will first apply the changes to the master node and then send the WAL records to the replicas. The advantage of this method is that the transaction is committed quickly because you don't have to wait for all replicas to apply the changes.
Synchronous replication. Changes are first written to the WAL of one or more replicas and then sent to the master server. The advantage is that this is a more reliable method and makes data loss more difficult.
There is also logical replication. This type works with records in PostgreSQL tables. This is different from streaming replication, which works at the level of physical data: bits, bytes, and block addresses on disk. One server publishes changes, another subscribes to them. You can subscribe to changes selectively. For example, the primary server has 50 tables: 25 can be copied to one replica, and 25 can be copied to another one.
Versatility: modules and extensions
The solution is suitable for use in any operating system: Linux, macOS, Windows. Most frameworks (Ruby on Rails, Yii, Symfony, Django) support the use of PostgreSQL in programming.
Thanks to modules and plugins, PostgreSQL adapts to a wide variety of tasks. For example, the pgcrypto module provides cryptographic functions and hstore is a data type for storing key-value pairs to achieve quasi-Redis or MongoDB.
Thanks to PostgreSQL extensions, PostgreSQL can be integrated with systems that previously did not understand DBMS commands or were slow and incorrect. This is how large-scale extensions for 1C, GIS, and IoT have emerged. We will discuss some of these solutions below; they deserve special attention.
Working with PostgreSQL, you can create your own data types. You need them to simplify working with the database or to set limits. For example, there is a device that displays only integers from 1 to 5. You need to store the data of the device in the database. You can create your own data type consisting only of the numbers 1, 2, 3, 4, 5. Then entering other values will be an error and therefore will not corrupt the data.
It sounds like magic, but PostgreSQL can even work with NoSQL, JSON and semi-structured data. Since 2012, there is a JSON data type for this purpose, which stores only valid JSON. Validation is performed before converting to this type, and any errors will be flagged:
--ERROR: invalid input syntax for type json
What PostgreSQL is used for in real conditions
Secure Transactions. PostgreSQL provides multi-layered security that includes support for data encryption, access control, and transaction auditing. PostgreSQL also provides protection against SQL injection, blocks access to system tables and functions, and provides other data security tools. PostgreSQL supports ACID requirements, so it is actively used in financial and banking environments.
Process and store large amounts of data. A popular task of PostgreSQL is integration into the architecture of multilingual sites with a large number of stores in different countries. The database allows fast synchronization of product information from points and warehouses, so that product lists are always up-to-date.
Data Analysis. DBMS capabilities are used to analyze large amounts of data and generate reports. For example, you can use OLAP cubes to create summary tables. This helps you use DBMSs in ML model training and machine learning tasks.
Features of MySQL with synchronous and semi-synchronous replication
MySQL is owned by Oracle and is one of the most popular database solutions, although it is ahead of PostgreSQL in terms of forks. MySQL is distributed for free for non-commercial use. Large-scale projects such as Github, Wikipedia, Google, Booking.com, and Yelp run on MySQL databases, so talks about MySQL being obsolete are the viewpoint of proponents of other DBMS.
The solution is relational, i.e. related data is stored in tables, similar to PostgreSQL. The system automatically selects the most appropriate result for the operator's request based on the available data. Each table contains a set of columns:
table_name - table name,
column_name - column name,
column_type - column data type.
The table creation command must include all of the above. A data type is defined for each column of the table. Incorrect use of data types increases memory consumption and table query time. Errors of this type are more often noticed in large projects when the number of rows reaches several tens of thousands.
MySQL can store data as a key-value pair and allows you to work with different data types. In between:
text and binary,
date and time,
A JSON field is better suited for storing JSON than a text field because:
Provides automatic document verification. This means that if we try to put something invalid in there, an error will occur.
JSON is stored in binary format, which allows you to move from one JSON document to another.
Any claim that an RDBMS is faster or more reliable is speculation. Any comparison should look at the infrastructure schema, query syntax, customizations, and settings. Without this data, it is impossible to give an objective answer.
How replication works in MySQL
Replication in MySQL consists of three main steps:
The master server records data changes (events) in a binary log (binary log).
The slave device copies changes from the binary log to its own log, called the relay log.
The slave device replicates the changes in the relay log by applying them to its data.
There are two fundamentally different approaches to replication: line-by-line and line-by-line. In the first case, data change requests (INSERT, UPDATE, DELETE) are written to the master's log, and the slave replicates the commands exactly. In line-by-line replication, changes in table rows will be displayed in the log, and the same actual changes will be applied to the slave devices.
MySQL supports both replication methods, and the default changes vary by version. Modern versions, such as MySQL 8, use line-by-line replication by default.
The second principle that separates replication approaches is the number of master servers. Having a master server means that it accepts data changes itself and is the template from which changes are propagated to other slaves. For example, you can give remote clients the same fast ability to make changes to the database.
Areas of use and advantages of MySQL
The most suitable application for MySQL is web resources and web applications. When it comes to working with heavy loads, there is probably no single opinion what to choose - MySQL or PostgreSQL, but among the advantages of the solution the following points are often emphasized:
High Speed. If you focus on open source testing tools such as Sysbench, it shows this performance in some DBMS configurations.
Supports many types of tables. For example, MyISAM, InnoDB.
Robust and simple security. MySQL uses ACL-based security for all connections, queries, and other operations that users may attempt to perform.
Practicality and flexibility of the system. Years of user experience and a large community allow us to find answers to the most unusual questions.
Adequate job performance. MySQL is used as a reliable and scalable storage for OLTP (special key-value) workloads.
GUI support. WorkBench, SequelPro, DBVisualizer and Navicat DB.
Great set of tools. Different types of tables, indexes, transactions, stored procedures.
MySQL is a near-universal solution for e-commerce due to its transactional DBMS and high query processing speed. MySQL is the same industry standard as MS SQL or PostgreSQL.
Features of Redis
On the one hand, you can say that Redis became famous because it works 16-20 times more efficiently than PostgreSQL or MySQL, on the other hand, Redis is never used as the only database in a project, but if you know such cases, describe them in the comments.
The DBMS stores all cached data in the available RAM. This avoids constant access to the main database and optimizes its load. It should be noted that Redis is the only representative of a non-relational DBMS among the 10 most popular solutions.
One of the most popular Redis tasks is message queuing. This is a mechanism for transferring data between applications or components, providing asynchronous processing and optimizing the load on the system.
DBMS uses the List data structure to build message queues. It allows you to add items to the end of the list and retrieve items from the top of the list, making it ideal for creating queues. Redis uses the LPUSH, RPUSH, LPOP, and RPOP commands to work with message queues. The LPUSH and RPUSH commands add items to the beginning and end of the list, respectively, and the LPOP and RPOP commands remove the first and last items from the list.
Redis also supports BLPOP and BRPOP lock operations, which wait for new items to appear in the list and automatically remove them from the list when they appear.
How Redis is used
Storing user sessions. For example, snippets of HTML code from websites, items from online shopping carts, or routes in browsers.
Temporary storage of data. They can be evaluated by publications, completed forms or tables.
Message Broker. The DBMS converts messages using a single source protocol into a destination protocol message and acts as an intermediary between the two to route and invoke network services when necessary.
User data storage. For example, for analytics and other applications where speed and latency-free transmission are important.
Machine learning. The fast memory used by the system to store information allows processing large amounts of data, automating processes and conducting experiments quickly.
Advantages of the solution
Asynchronous replication. This means that if you copy information to multiple connected servers, it will distribute requests among themselves and increase read speed.
Scalability. Redis allows you to customize the cluster architecture, choose the cluster size, or scale out. This will keep your projects running quickly and reliably.
Flexibility. Unlike conventional storage, Redis allows you to work with unstructured data - it is stored by type: strings, lists, streams, etc..
Redis has earned a place in the top ten not only for its performance and functionality, but also for its ability to easily integrate with other systems. Redis capabilities are maximized when working with PostgreSQL or MySQL to solve problems with data queues or caching.
TimescaleDB is a PostgreSQL extension for working with time series. Time series can be stored in PostgreSQL without additional customization, but TimescaleDB offers better performance on the same hardware for this type of tasks. Most often, time series are used to collect data from any device and build additional predictions based on it. The DBMS inherits fault tolerance, toolkit, and the entire PostgreSQL ecosystem, such as support for native data types.
Many engineers start out using PostgreSQL to store time series data because of its reliability and ease of use, but when the amount of data grows significantly, they switch to some NoSQL system. Some abandon PostgreSQL altogether in favor of working with time series.
TimescaleDB supports scaling tables up to billions of rows while maintaining high and consistent data insertion (INSERT) rates. TimescaleDB allows you to store relational metadata and time series data in the same database, run queries using time series-optimized SQL, and continue to use your favorite tools and plug-ins.
DBMSs use trading platforms to display currency quotes. The solution is used in telemetry of transportation routes. All statistics of server access or CPU utilization is represented in time series. There are many tools available for this kind of work. For example, InfluxDB or ClickHouse, but even popular solutions for storing time series have their own peculiarities.
The problem with time series repositories is that they are few in number. This means that they are suitable for time series, period data. However, integrating them into your IT infrastructure is a separate and time-consuming process that can be simplified with TimescaleDB.
The non-permanent ranks have their own privileges
Time Stamp. Each time series record has a timestamp field where the value is stored.
Ubiquitous mode is for augmentation only. New data does not replace old data. Only obsolete data is deleted.
Applications are not considered separately. Data is only used in aggregate by window or time period.
When should I use TimescaleDB
TimescaleDB can be an alternative to, for example, InfluxDB in small projects. The solution works well with a small fixed set of tags.
Working with a database is closely related to inserting data (often when working with tens of millions of rows, depending on the memory allocated).
Queries are urgent and require more than one lookup of key values.
When SQL time series analysis capabilities are not enough. We need a time binding that records two dates of an object (time and event record date).
When data deletion becomes a problem.
TimescaleDB vs PostgreSQL
Compared to basic PostgreSQL, TimescaleDB demonstrates:
New functionality. Time series processing in SQL has become even easier thanks to the Internet of Things.
Accelerated Insertion. For example, standard PostgreSQL takes almost 40 hours to insert a billion rows of data, while TimescaleDB handles the entire volume in about 3 hours.
Fast cleanup. Delete data 2000 times faster for flexible storage policies.
Query optimization. Specialized query expansion system (up to 14,000 parameters) to speed up work with dependencies.
Although the TimescaleDB specification is more limited, the solution is gradually gaining prominence due to easier and cheaper integration into projects if the team is already familiar with PostgreSQL and uses it as a relational DBMS.
It seems that many DBMSs today have achieved similar functionality and design choices are more in line with existing habits: we work with what we already have. However, solutions such as Redis and TimescaleDB are more specialized and can specifically cover business problems. For example, they can handle time series and cache more efficiently so that users don't have to access the database every time. As a rule, they are not used in a project as the only DBMS, but together with MySQL or PostgreSQL they show excellent results and expand the service capabilities.