Performance Tuning PostgreSQL
The difficulty is that every database is not only different in its design, but also in its requirements. Some systems are used to logging mountains of data that is almost never queried. Others have essentially static data that is queried constantly, sometimes feverishly.
Most systems however have some, usually unequal, level of reads and writes to the database. Add this little complexity on top of your unique database schema structure, data, and hardware configuration, and hopefully you begin to see why tuning can be difficult.
The default configuration PostgreSQL ships with is a very solid configuration aimed at everyone's best guess as to how an “average” database on “average” hardware should be setup. This article aims to help PostgreSQL users of all levels better understand PostgreSQL performance tuning.
Understanding the process
The first step to learning how to tune your PostgreSQL database is to understand the life cycle of a query. Here are the steps of a query:
- Transmission of query string to database
- Parsing of query string
- Planning of query
- Retrieval of data from hardware
- Transmission of results to client
The first step is the sending of the query string (the actual SQL command you type in, or your application uses) to the database backend. There isn't much you can tune about this step; however, if you have very large queries that cannot be prepared in advance, it may help to put them into the database as a stored procedure and cut the data transfer down to a minimum.
Once the SQL query is inside the database server, it is parsed into tokens. This step can also be minimized by using stored procedures.
The planning of the query is where PostgreSQL really starts to do some work. This stage checks to see if the query is already prepared and if your version of PostgreSQL and client library support this feature. It also analyses your SQL to determine what the most efficient way of retrieving your data is. Should we use an index, and if so which one? Maybe a hash join on those two tables is appropriate?
These are some decisions the database makes at this point of the process. This step can be eliminated if the query is previously prepared. Now that PostgreSQL has a plan of what it believes to be the best way to retrieve the data, it is time to actually get it. While there are some tuning options that help here, your hardware configuration mostly affects this step.
And finally, the last step is to transmit the results to the client. While there aren't any real tuning options for this step, you should be aware that all the data that you are returning is pulled from the disk and sent over the wire to your client. Minimizing the number of rows and columns to only those that are necessary can often increase your performance.
There are several postmaster options that can be set that drastically impact performance. Below is a list of the most commonly used and how they affect performance:
max_connections: This option sets the maximum number of database backends to have at any one time. Use this feature to ensure that you do not launch so many backends that you begin swapping to disk and kill the performance of all the children. Depending on your application, it may be better to deny the connection entirely rather than degrade the performance of all the other children.
shared_buffers: Editing this option is the simplest way to improve the performance of your database server. The default is pretty low for most modern hardware. General wisdom says that this should be set to roughly 25% of available RAM on the system. Like most of the options I will outline here, you will simply need to try them at different levels (both up and down) and see how well it works on your particular system. Most people find that setting it larger than a third starts to degrade performance.
effective_cache_size: This value tells PostgreSQL's optimizer how much memory PostgreSQL has available for caching data and helps to determine whether it uses an index. A large value increases the likelihood of using an index. This should be set to the amount of memory allocated to shared_buffers plus the amount of OS cache available. Often this is more than 50% of the total system memory.
work_mem: This option is used to control the amount of memory usage in sort operations and hash tables. While you may need to increase the amount of memory if you do a ton of sorting in your application, care needs to be taken. This isn't a system-wide parameter, but a per operation one. So if a complex query has several sort operations in it, it will use multiple work_mem units of memory. Not to mention that multiple backends could be doing this at once. This query can frequently lead your database server to swap if the value is too large. This option was previously called sort_mem in older versions of PostgreSQL.
max_fsm_pages: This option helps to control the free space map. When something is deleted from a table it isn't removed from the disk immediately, it is simply marked as "free" in the free space map. The space can then be reused for any new INSERTs that you do on the table. If your setup has a high rate of DELETEs and INSERTs it may be necessary to increase this value to avoid table bloat. Note that max_fsm_pages is removed as of PostgreSQL 8.4 and later.
fsync: This option determines if all your WAL pages are fsync'ed to disk before a transaction is committed. Having this on is safer, but can reduce write performance. If fsync is not enabled, there is the chance of unrecoverable data corruption. Turn this off at your own risk.
commit_delay and commit_siblings: These options are used in concert to help improve performance by writing out multiple transactions that are committing at once. If there are commit_siblings number of backends active at the instant your transaction is committing, then the server waits commit_delay microseconds to try to commit multiple transactions at once.
random_page_cost: This option controls the way PostgreSQL views non‐sequential disk reads. A higher value makes it more likely that a sequential scan will be used over an index scan, indicating that your server has very fast disks.
Note: Many of these options consume shared memory, and it will probably be necessary to increase the amount of shared memory allowed on your system to get the most out of these options.
Obviously, the type and quality of the hardware you use for your database server drastically impacts the performance of your database. Here are a few tips to use when purchasing hardware for your database server (in order of importance):
RAM: The more RAM you have, the more disk cache you will have. This greatly impacts performance, considering memory I/O is thousands of times faster than disk I/O.
Disk types: Obviously, fast SAS disks are your best option; however, high end SATA drives are also very good. With SATA each disk is substantially cheaper and with that you can afford more spindles than with SAS on the same budget.
Disk configuration: The best configuration is RAID 1+0 with as many disks as possible and with your transaction log (pg_xlog) on a separate disk (or stripe) all by itself. RAID 5 is not a very good option for databases unless you have more than 6 disks in your volume. With newer versions of PostgreSQL, you can also use the tablespaces option to put different tables, databases, and indexes on different disks to help optimize performance. Such as putting your often used tables on a fast SAS disk and the less used ones slower IDE or SATA drives.
CPUs: The more CPUs the better, however if your database does not use many complex functions, your money is best spent on more RAM or a better disk subsystem.
In general, the more RAM and disk spindles you have in your system, the better it will perform. This is because with the extra RAM, you will access your disks less. And the extra spindles help spread the reads and writes over multiple disks to increase throughput and to reduce drive head congestion.
Another good idea is to separate your application code and your database server onto different hardware. Not only does this provide more hardware dedicated to the database server, but the operating system's disk cache will contain more PostgreSQL data and not other various application or system data this way.
For example, if you have one web server and one database server you can use a crossover cable on a separate Ethernet interface to handle just the web server to database network traffic to ensure you reduce any possible bottlenecks there. You can also obviously create an entirely different physical network for database traffic if you have multiple servers that access the same database server.
These issues typically affect both stateful and stateless applications in the same fashion. One good technique is to use server side prepared queries for any queries you execute often. This reduces the overall query time by caching the query plan for later use.
It should be noted, however, if you prepare a query in advance using placeholder values (such as 'column_name = ?') then the planner will not always be able to choose the best plan. For example, if your query has a placeholder for the boolean column 'active' and you have a partial index on false values, the planner won't use it because it cannot be sure the value passed in on execution will be true or false.
You can also obviously utilize stored procedures here to reduce the transmit, parse, and plan portions of the typical query life cycle. It is best to profile your application and find commonly used queries and data manipulations and put them into a stored procedure.
Explore the benefits and drawbacks of the rolling upgrade, blue-green deployment, canary release, and immutable infrastructure strategies...
Explore the challenges of database schema migration and various tools like Liquibase, Flyway, Schema Guard and more products for effective schema versioning...
Read my write out about the Dynamic Management Views in SQL Server...
Sometimes we're making strange things to gain more performance from the existing hardware, here is one theoretical way to run a database faster...