| <html> |
| |
| <center> |
| <h1> The "server-process-edition" Branch</h1> |
| </center> |
| |
| <p> |
| The "server-process-edition" branch contains two modifications to stock |
| SQLite that work together to provide concurrent read/write transactions |
| using pessimistic page-level-locking. The system runs in two modes: |
| |
| <ul> |
| <li><p> Single-process mode - where all clients must be within the |
| same address space, and |
| <li><p> Multi-process mode - where clients may be distributed between |
| multiple OS processes. |
| </ul> |
| |
| <p> The system is designed to be most efficient when used with |
| <a href="https://www.sqlite.org/pragma.html#pragma_synchronous"> |
| "PRAGMA synchronous=OFF"</a>, although it does not require this. |
| |
| <p> |
| Up to 16 simultaneous read/write transactions controlled by page-level-locking |
| are possible. Additionally, in single-process mode there may be any number of |
| read-only transactions started using the "BEGIN READONLY" command. Read-only |
| transactions do not block read-write transactions, and read-write transactions |
| do not block read-only transactions. Read-only transactions access a consistent |
| snapshot of the database - writes committed by other clients after the |
| transaction has started are never visible to read-only transactions. In |
| multi-process mode, the "BEGIN READONLY" command is equivalent to a stock |
| "BEGIN". |
| |
| <p> |
| The two features on this branch are: |
| <ol> |
| <li><p> An |
| <a href=#freelist>alternative layout for the database free-page list</a>. |
| This is intended to reduce contention between writers when allocating |
| new database pages, either from the free-list or by extending the |
| database file. |
| |
| <li><p> The <a href=#servermode>"server-mode" extension</a>, which |
| provides read/write page-level-locking concurrency and (in |
| single-process mode) read-only MVCC concurrency mentioned above. |
| </ol> |
| |
| |
| <h2 id=freelist> 1.0 Alternative Free-List Format </h2> |
| |
| <p> |
| The alternative free-list format is very similar to the current format. It |
| differs in the following respects: |
| |
| <ul> |
| <li><p>The "total number of free pages" field in the db header is not |
| maintained. It is always set to zero. |
| <li><p> Instead of pointing to the first free-list trunk page, the free-list |
| pointer in the db header points to a page known as the "free-list node". |
| <li><p> The free-list node page contains N pointers to free-lists stored in the |
| legacy format (i.e. a linked list of trunk pages each containing |
| pointers to many free leaf pages). |
| </ul> |
| |
| <p> |
| This effectively means that a database has N free-lists instead of just one. To |
| allocate a free page, a writer only needs to lock one such free-list, and so up |
| to N transactions that allocate new pages may proceed concurrently. |
| |
| <p> |
| Allocating pages from the end of the db file still locks out all other |
| read/write transactions (because it involves writing to page 1, which every |
| transaction needs to read). To minimize the frequency with which this occurs, |
| when a page must be allocated from the end of the database file, the file is |
| extended by 2048 pages. These are distributed equally between 16 free-lists |
| (children of the free-list node page). Additionally, the first trunk page in |
| each free list is never reused. Doing so would require writing to the |
| free-list node page - effectively an exclusive lock on the entire |
| page-allocation system. |
| |
| <p> |
| The current format used for the free-list can be modified or queried using a |
| new pragma: |
| |
| <pre> |
| PRAGMA [database.]freelist_format; |
| PRAGMA [database.]freelist_format = 1|2; |
| </pre> |
| |
| <p> |
| At present, the free-list format may only be modified when the free-list is |
| completely empty. Which, as the implementation ensures that a free-list that |
| uses the alternative format is never completely emptied, effectively precludes |
| changing the format from 2 (alternative) to 1 (legacy). |
| |
| <p> |
| For databases that use the "alternative" free-list format, the read and write |
| versions in the database header (byte offsets 18 and 19) are set to 3 for |
| rollback mode or 4 for wal mode (instead of 1 and 2 respectively). |
| |
| <h2 id=servermode> 2.0 Page level locking - "Server Mode" </h2> |
| |
| <p> |
| A database client automatically enters "server mode" if there exists a |
| <i>directory</i> named "<database>-journal" in the file system alongside |
| the database file "<database>" There is currently no provision for |
| creating this directory, although it could be safely done for a database in |
| rollback mode using something like: |
| |
| <pre> |
| PRAGMA journal_mode = off; |
| BEGIN EXCLUSIVE; |
| <create directory> |
| END; |
| </pre> |
| |
| <p> As well as signalling new clients that they should enter server-mode, |
| creating a directory named "<database>-journal" has the helpful |
| side-effect of preventing legacy clients from accessing the database file at |
| all. |
| |
| <p> If the VFS is one that takes an exclusive lock on the db file (to |
| guarantee that no other process accesses the db file), then the system |
| automatically enters single-process mode. Otherwise, multi-process mode. |
| |
| <p> In both single and multi-process modes, page-level-locking is managed |
| by allocating a fixed-size array of "locking slots". Each locking slot is |
| 32-bits in size. By default, the array contains 262144 (2^18) slots. Pages are |
| assigned to locking slots using the formula (pgno % 262144) - so pages 1, |
| 262145, 524289 etc. share a single locking slot. |
| |
| <p> In single-process mode, the array of locking slots is allocated on |
| the process heap and access is protected by a mutex. In multi-process mode, it |
| is created by memory-mapping a file on disk (similar to the *-shm file in |
| SQLite wal mode) and access is performed using |
| <a href="https://en.wikipedia.org/wiki/Compare-and-swap">atomic CAS |
| primitives</a> exclusively. |
| |
| <p> Each time a read/write transaction is opened, the client assumes a client |
| id between 0 and 15 for the duration of the transaction. Client ids are unique |
| at any point in time - concurrently executing transactions must use different |
| client ids. So there may exist a maximum of 16 concurrent read/write |
| transactions at any one time. |
| |
| <p> Read/write transactions in server-mode are similar to regular SQLite |
| transactions in rollback mode. The most significant differences are that: |
| |
| <ul> |
| <li> <p>Instead of using journal file <database>-journal, server-mode |
| clients use <database>-journal/<client-id>-journal. If |
| there are multiple concurrent transactions, each uses a separate |
| journal file. |
| |
| <li> <p>No database-wide lock is taken. Instead, individual read and write |
| locks are taken on the pages accessed by the transaction. |
| </ul> |
| |
| <p> Each locking slot is 32-bits in size. A locking slot may simultaneously |
| support a single write-lock, up to 16 read-locks from read/write clients, and |
| (in single process mode) up 1024 read-locks from "BEGIN READONLY" clients. |
| Locking slot bits are used as follows: |
| |
| <ul> |
| <li> <p> The least-significant 16-bits are used for read-locks taken by |
| read/write clients. To take a read-lock, bit <client-id> of the |
| locking slot is set. |
| |
| <li> <p> The next 5 bytes are used for the write-lock. If no write-lock |
| is held on the slot, then this 5 byte integer is set to 0. Otherwise, |
| it is set to (<i>C</i> + 1), where <i>C</i> is the <client-id> of |
| the client holding the write-lock. |
| |
| <li> <p> The next 10 bits contain the total number of read-locks held by |
| "BEGIN READONLY" clients on the locking slot. See the section below |
| for a description of how these are used. |
| </ul> |
| |
| <p> Currently, if a client requests a lock that cannot be granted due to |
| a conflicting lock, SQLITE_BUSY is returned to the caller and either the |
| entire transaction or statement transaction must be rolled back. See |
| <a href=#problems>Problems and Issues</a> below for more details. |
| |
| <h3> 2.1 Single-Process Mode </h3> |
| |
| <p> Single process mode is simpler than multi-process mode because it does |
| not have to deal with runtime client failure - it is assumed that if one |
| client fails mid-transaction the entire process crashes. As a result the |
| only time hot-journal rollback is required in single-process mode is as |
| part of startup. The first client to connect to a database in single-process |
| mode attempts to open and rollback all 16 potential hot journal files. |
| |
| <p> But, in order to support non-blocking "BEGIN READONLY" transactions, it is |
| also in some ways more complicated than multi-process mode. "BEGIN READONLY" |
| support works as follows: |
| |
| <ul> |
| |
| <li> <p>In single-process mode, writers never spill the cache mid-transaction. |
| Data is only written to the database as part of committing a transaction. |
| |
| <li> <p>As well as writing the contents of overwritten pages out to the journal |
| file, a writer in single-process mode also accumulates a list of buffers |
| containing the original data for each page overwritten by the current |
| transaction in main-memory. |
| |
| <li> <p>When a transaction is ready to be committed, a writer obtains a |
| transaction-id. Transaction-ids are assigned to writers using a |
| monotonically increasing function. The writer then adds all of its "old |
| data" buffers to a hash table accessible to all database clients. |
| Associated with each hash table entry is the newly assigned transaction-id. |
| It then waits (spin-locks) for all "BEGIN READONLY" read-locks to clear on |
| all pages that will be written out by the transaction. Following this, it |
| commits the transaction as normal (writes out the dirty pages and zeroes |
| the journal file header). |
| |
| <li> <p>Clients executing "BEGIN READONLY" transactions are not assigned |
| a <client-id>. Instead, they are assigned a transaction-id that is |
| either (a) that of the oldest transaction-id belonging to a writer that has |
| not yet finished committing, or (b) if there are currently no writers |
| committing then the value that will be assigned to the next committer. |
| |
| <li> <p>When a "BEGIN READONLY" transaction reads a page, it first checks |
| the aforementioned hash table for a suitable entry. A suitable entry |
| is one with the right page-number and a transaction-id greater than or |
| equal to that of the "BEGIN READONLY" transaction (i.e. one that had not |
| finished committing when the BEGIN READONLY transaction started). If such |
| an entry can be found, the client uses the associated data instead of |
| reading from the db file. Or, if no such entry is found, the client: |
| <ol> |
| <li> Increments the number of BEGIN READONLY read-locks on the page. |
| <li> Reads the contents of the page from the database file. |
| <li> Decrements the number of BEGIN READONLY read-locks on the page. |
| </ol> |
| <p> The mutex used to protect access to the array of locking slots and |
| the shared hash table is relinquished for step 2 above. |
| |
| <li> <p>After each transaction is commited in single-process mode, the |
| client searches the hash table for entries that can be discarded. An |
| entry can be discarded if it has a transaction-id older than any still |
| in use (either by BEGIN READONLY transactions or committers). |
| </ul> |
| |
| <h3> 2.2 Multi-Process Mode </h3> |
| |
| <p> Multi-process mode differs from single-process mode in two important ways: |
| |
| <ul> |
| <li> <p>Individual clients may fail mid-transaction and the system must recover |
| from this. |
| |
| <li> <p>Partly as a consequence of the above, there are no convenient |
| primitives like mutexes or malloc() with which to build complicated data |
| structures like the hash-table used in single-process mode. As a result, |
| there is no support for "BEGIN READONLY" transactions in multi-process |
| mode. |
| </ul> |
| |
| <p> Unlike single-process mode clients, which may be assigned a different |
| client-id for each transaction, clients in multi-process mode are assigned a |
| client-id when they connect to the database and do not relinquish it until |
| they disconnect. As such, a database in multi-process server-mode supports |
| at most 16 concurrent client connections. |
| |
| <p> As well as the array of locking slots, the shared-memory mapping used |
| by clients in multi-process mode contains 16 "client slots". When a client |
| connects, it takes a posix WRITE lock on the client slot that corresponds |
| to its client id. This lock is not released until the client disconnects. |
| Additionally, whenever a client starts a transaction, it sets the value |
| in its client locking slot to 1, and clears it again after the transaction |
| is concluded. |
| |
| <p> This assists with handling client failure mid-transaction in two ways: |
| |
| <ul> |
| <li><p> If client A cannot obtain a lock due to a conflicting lock held by |
| client B, it can check whether or not client B has failed by attempting a |
| WRITE lock on its client locking slot. If successful, then client B must |
| have failed and client A may: |
| <ul> |
| <li> Roll back client B's journal, and |
| <li> By iterating through the entire locking slot array, release all |
| locks held by client B when it failed. |
| </ul> |
| |
| <li><p> When a client first connects and locks its client locking slot, it |
| can check whether or not the previous user of the client locking slot failed |
| mid-transaction (since if it did, the locking slot value will still be |
| non-zero). If it did, the new owner of the client locking slot can release |
| any locks and roll back any hot-journal before proceeding. |
| </ul> |
| |
| <h3> 2.3 Required VFS Support </h3> |
| |
| <p> The server-mode extension requires that the VFS support various special |
| file-control commands. Currently support is limited to the "unix" VFS. |
| |
| <dl> |
| <dt> SQLITE_FCNTL_SERVER_MODE |
| <dd><p> This is used by SQLite to query the VFS as to whether the |
| connection should use single-process server-mode, multi-process server-mode, |
| or continue in legacy mode. |
| |
| <p>SQLite invokes this file-control as part of the procedure for detecting a |
| hot journal (after it has established that there is a file-system entry named |
| <database>-journal and that no other process holds a RESERVED lock). |
| If the <database>-journal directory is present in the file-system and |
| the current VFS takes an exclusive lock on the database file (i.e. is |
| "unix-excl"), then this file-control indicates that the connection should use |
| single-process server-mode. Or, if the directory exists but the VFS does not |
| take an exclusive lock on the database file, that the connection should use |
| multi-proces server-mode. Or, if there is no directory of the required name, |
| that the connection should use legacy mode. |
| |
| <dt> SQLITE_FCNTL_FILEID |
| <dd><p> Return a 128-bit value that uniquely identifies an open file on disk |
| from the VFS. This is used to ensure that all connections to the same |
| database from within a process use the same shared state, even if they |
| connect to the db using different file-system paths. |
| |
| <dt> SQLITE_FCNTL_SHMOPEN |
| <dd> |
| |
| <dt> SQLITE_FCNTL_SHMOPEN2 |
| <dd> |
| |
| <dt> SQLITE_FCNTL_SHMLOCK |
| <dd> |
| |
| <dt> SQLITE_FCNTL_SHMCLOSE |
| <dd> |
| </dl> |
| |
| |
| <h2 id=problems> 3.0 Problems and Issues </h2> |
| |
| <ul> |
| |
| <li> <p>Writer starvation might be the biggest issue. How can it be |
| prevented? |
| |
| <li> <p>Blocking locks of some sort would likely improve things. The issue |
| here is deadlock detection. |
| |
| <li> <p>The limit of 16 concurrent clients in multi-process mode could be |
| raised to 27 (since the locking-slot bits used for BEGIN READONLY |
| locks in single-process mode can be reassigned to support more |
| read/write client read-locks). |
| |
| </ul> |
| |
| <h2> 4.0 Performance Test </h2> |
| |
| <p> |
| The test uses a single table with the following schema: |
| |
| <pre> |
| CREATE TABLE t1(a INTEGER PRIMARY KEY, b BLOB(16), c BLOB(16), d BLOB(400)); |
| CREATE INDEX i1 ON t1(b); |
| CREATE INDEX i2 ON t1(c); |
| </pre> |
| |
| <p> |
| The database initially contains 5,000,000 rows. Values for column "a" are |
| between 1 and 5,000,000, inclusive. Other columns are populated with randomly |
| generated blob values, each 16, 16, and 400 bytes in size, respectively. |
| |
| <p> |
| Read/write transactions used by the test take the following form. Each such |
| transaction modifies approximately 25 pages (5 in the main table and 10 in each |
| index), not accounting for tree rebalancing operations. |
| |
| <pre> |
| BEGIN; |
| REPLACE INTO t1 VALUES(abs(random() % 5000000), randomblob(16), randomblob(16), randomblob(400)); |
| REPLACE INTO t1 VALUES(abs(random() % 5000000), randomblob(16), randomblob(16), randomblob(400)); |
| REPLACE INTO t1 VALUES(abs(random() % 5000000), randomblob(16), randomblob(16), randomblob(400)); |
| REPLACE INTO t1 VALUES(abs(random() % 5000000), randomblob(16), randomblob(16), randomblob(400)); |
| REPLACE INTO t1 VALUES(abs(random() % 5000000), randomblob(16), randomblob(16), randomblob(400)); |
| COMMIT; |
| </pre> |
| |
| <p> |
| Read-only transactions are as follows: |
| |
| <pre> |
| BEGIN READONLY; |
| SELECT * FROM t1 WHERE a>abs((random()%5000000)) LIMIT 10; |
| SELECT * FROM t1 WHERE a>abs((random()%5000000)) LIMIT 10; |
| SELECT * FROM t1 WHERE a>abs((random()%5000000)) LIMIT 10; |
| SELECT * FROM t1 WHERE a>abs((random()%5000000)) LIMIT 10; |
| SELECT * FROM t1 WHERE a>abs((random()%5000000)) LIMIT 10; |
| END; |
| </pre> |
| |
| <p> |
| The performance test features one or more clients executing read/write |
| transactions as fast as possible, and zero or more clients executing read-only |
| transactions, also as fast as possible. All tests use the "unix-excl" VFS and |
| all clients execute in a separate thread within the same process. The database |
| is located on a tmpfs file-system. |
| |
| <p> |
| In the table below "rw:" refers to the number of read-write clients, and "ro:" |
| the number of read-only clients used by the test. The TPS values in brackets |
| are the number of read-only transactions per second. All other values are |
| read-write transactions per second. |
| |
| <p> |
| The collision rate (percentage of attempted transactions that failed due to a |
| page-level locking conflict) in all tests was between 1 and 2%. Failed |
| transactions are not included in the TPS counts below. |
| |
| <p> |
| <table border=1 width=90% align=center> |
| <!-- |
| 320 jm=persist, rw:1 139675, 135797 |
| 320 jm=wal, rw:1 120995, 118889 |
| begin-concurrent, rw:1 119438, 117580 |
| begin-concurrent, rw:2 166923, 166904 |
| begin-concurrent, rw:3 180432, 172825 |
| begin-concurrent, rw:2,ro:1 150427(347319),152087(351601) |
| server-mode, rw:1 126592, 126742 |
| server-mode, rw:2 228317, 227155 |
| server-mode, rw:3 309712, 306218 |
| server-mode, rw:2,ro:1 213303(576032),210994(556005) |
| --> |
| |
| <tr><th> Configuration <th>TPS per client <th> TPS total |
| <tr><td> 3.20.0, journal_mode=persist, rw:1, ro:0 <td>6886 <td> 6886 |
| <tr><td> 3.20.0, journal_mode=wal, rw:1, ro:0 <td>5997 <td> 5997 |
| <tr><td> begin-concurrent, rw:1, ro:0<td>5925<td> 5925 |
| <tr><td> begin-concurrent, rw:2, ro:0<td>4172 <td> 8345 |
| <tr><td> begin-concurrent, rw:3, ro:0<td>2943 <td> 8831 |
| <tr><td> begin-concurrent, rw:2, ro:1<td>3781 <td> 7562 (17473) |
| <tr><td> server-mode, rw:1, ro:0 <td>6333 <td> 6333 |
| <tr><td> server-mode, rw:2, ro:0<td> 5693 <td> 11386 |
| <tr><td> server-mode, rw:3, ro:0<td>5132 <td> 15398 |
| <tr><td> server-mode, rw:2, ro:1<td>5303 <td> 10607 (28300) |
| </table> |
| |