4

Does Google's Persistent Disk have proper reader/writer locks for concurrent access (many virtual machines accessing data from a single Persistent Disk) which is required for SQLite?

According to the SQLite FAQ:

SQLite uses reader/writer locks to control access to the database. [...] But use caution: this locking mechanism might not work correctly if the database file is kept on an NFS filesystem. This is because fcntl() file locking is broken on many NFS implementations. You should avoid putting SQLite database files on NFS if multiple processes might try to access the file at the same time.

I would really like to know if Google's persistent disk locks properly for writes. I research AWS's EFS and it does not have a proper locking system for supporting SQLite.

Misha Brukman
  • 768
  • 8
  • 22
Jon
  • 143
  • 1
  • 5
  • 1
    Using sqlite in a 1:X scenario? thought to migrate your code to mysql ? as its doable, I did it for a small project and im far to be a good programmer – yagmoth555 Dec 31 '16 at 15:50
  • 1
    Thanks for the answer. MySQL or Cloud SQL is out our options entirely due to the need to pass around a single sqlite file. That architecture can not be change. – Jon Dec 31 '16 at 18:28
  • Did you find out if the persistent disks have the correct file APIs for correct SQLite operation? I think the persistent disks appear as normal disks to the compute engine VM instance, not as a NFS disk? – zino Feb 14 '19 at 13:30
  • See also [Dqlite](https://dqlite.io); I've updated my answer to give more context. – Misha Brukman Aug 30 '19 at 17:15

2 Answers2

4

Update (30 Aug 2019)

Dqlite by Canonical provides a distributed, high-availability SQLite which may suit this use case. It's open-source under Apache 2.0, and written in C, so it could be a drop-in replacement for SQLite. See also discussion on HN for more context.


Earlier answer

It sounds like from your comments that MySQL and Google Cloud SQL are not an option due to your architecture requiring the use of a single SQLite file.

Also, per SQLite docs, NFS is not an option due to the locking issues.

Here are some other options to consider.

Alternative distributed filesystem

In addition to NFS, there are a number of other distributed filesystems you may want to evaluate, such as Ceph, GlusterFS, OrangeFS, ZFS, etc. In addition to your own research, consider reaching out to SQLite users or developers for guidance and past experiences.

Use NFS, but enforce single writer-at-a-time

The NFS issue appears to be about locking, which is only needed for writes: as long as you can guarantee that only one process has the database locked for writes at-a-time, several other processes can open it for reads so this should be OK (please confirm/verify that this is the case).

Thus, as long as there's an external method to ensuring a single-writer, you may be able to use NFS. Consider using a distributed lock service such as Apache ZooKeeper, HashiCorp Consul or CoreOS etcd for the lock service, and you can store your SQLite on NFS.

This, of course, relies on each process with direct access to SQLite database to properly close it when it no longer needs to write to it, so correctness is hard to enforce, as it relies on all software to be correct and cooperating.

Lightweight RPC server

You mentioned that your architecture (which cannot be changed at this time) relies on SQLite, but if it's possible to have them call an RPC service instead of opening the file directly, you can have that server be the only point of opening the SQLite database and avoid the locking issue from multiple concurrent users. However, this means that you would have to change all the client code to call the RPC service instead of opening the SQLite database directly, which is a non-trivial amount of work.

Conclusion

None of these options are trivial and will require work. The reason is that:

In contrast to many other database management systems, SQLite is not a client–server database engine. Rather, it is embedded into the end program.

As such, it is not the right solution for multiple accessors, and hence a bunch of workarounds are required.

Longer term, if you're in a situation where you would have to make a significant changes to continue to be able to maintain this system, you may want to consider migrating to MySQL or Google Cloud SQL instead of investing into workarounds to continue to use SQLite.

Misha Brukman
  • 768
  • 8
  • 22
  • 1
    Thanks for the info on Apache ZooKeeper, HashiCorp Consul, etc. Ill do some research on those. The main issue for our architecture is scaling. Think about thousands of web crawlers each inserting thousands of records a second into a central data base. We tried MySQL for that purpose and it utterly failed. So now each crawler uses its own SQLite db which is much more scalable. But we would like to have them centrally accessed if possible without having to move them in and out of S3. Their size can be quite large. Anyways thanks for the detailed answers! – Jon Jan 01 '17 at 15:46
  • 2
    @Jon: thanks for more information about your use case. Since you're talking about web crawlers, you're storing information on a per-web-page basis. You should consider long term re-architecting to use a distributed database, because SQLite will not scale far for you in this use case. Also consider using either [Datastore](https://cloud.google.com/datastore/) or [Bigtable](https://cloud.google.com/bigtable/) — disclaimer: I'm the product manager for Bigtable — it's the database Google built for its own web search indexing; see the [paper](https://research.google.com/archive/bigtable.html). – Misha Brukman Jan 01 '17 at 15:56
  • 2
    Ill defiantly look into Bigtable. It looks promising for my purposes. – Jon Jan 01 '17 at 16:20
3

Re: using SQLite for multiple access: the FAQ you linked to says:

(5) Can multiple applications or multiple instances of the same application access a single database file at the same time?

Multiple processes can have the same database open at the same time. Multiple processes can be doing a SELECT at the same time. But only one process can be making changes to the database at any moment in time, however.

which means that you should not use SQLite as a general-purpose database, it's intended to be used as an embedded database for a single writer, though it can also support several readers.

Re: Persistent Disks: see my answer to a related question on Stack Overflow; the short story is that a persistent disk on Google Compute Engine can be mounted either:

  • read-write to a single instance
  • read-only to several instances

Thus, you cannot share persistent disk in read-write mode to several VMs as you are suggesting. If you want a shared database, as yagmoth555 pointed out in the comments, you should use a SQL database, such as MySQL.

Conveniently, Google Cloud SQL provides a managed MySQL instance which you can use from multiple VMs, as it provides proper locking as well as backups, failover, etc.

Misha Brukman
  • 768
  • 8
  • 22
  • 1
    Thanks for the answer. MySQL or Cloud SQL is out our options entirely due to the need to pass around a single sqlite file. That architecture can not be change. I looked at your stackoverflow answer and you referenced `However, if you attach a persistent disk to multiple instances, all instances must attach the persistent disk in read-only mode.` but I can not find the documentation that states that here: https://cloud.google.com/compute/docs/disks/#flexibility Can you point me to that statement? – Jon Dec 31 '16 at 18:26
  • 2
    @Jon: looks like the docs have been re-arranged, that information now lives on [this page](https://cloud.google.com/compute/docs/disks/add-persistent-disk#use_multi_instances): _" If you attach a persistent disk to multiple instances, all of those instances must attach the persistent disk in read-only mode. It is not possible to attach the persistent disk to multiple instances in read-write mode."_ – Misha Brukman Dec 31 '16 at 18:45
  • Thanks that exactly what I needed to know. I dont know why I could not find that info. – Jon Jan 01 '17 at 15:39