CrateDB
CrateDB is a distributed SQL database management system that integrates a fully searchable document-oriented data store. It is open-source, written in Java, based on a shared-nothing architecture, and designed for high scalability. CrateDB includes components from Presto, Lucene, Elasticsearch and Netty.
Developer(s) | Crate.io, Inc. |
---|---|
Written in | Java |
Operating system | Cross-platform |
Type | Data Store |
License | Apache License 2.0 |
Website | crate |
History
The CrateDB project was started by Jodok Batlogg, an open source contributor and creator who has contributed to the Open Source Initiative Vorarlberg[1] while at Lovely Systems in Dornbirn. The software is an open source, clustered database used for fast text search and analytics.[2]
The company, now called Crate.io, raised its first round of financing in April 2014.[3] In June that year, Crate.io won the judge's choice award at the GigaOm Structure Launchpad competition.[4] In October, Crate.io won the TechCrunch Disrupt Europe in London.[5]
Crate.io closed a $4M founding round in March 2016. [6] In December, CrateDB 1.0 was released having more than one million downloads. [7][8]
CrateDB 2.0, the first Enterprise Edition of CrateDB, was released in May 2017 [9][10][11]after a $2.5M round from Dawn Capital, Draper Esprit, Speedinvest, and Sunstone Capital. [12]
CrateDB 4.0 was released in June 2019. [13]
Since September 2020, Crate.io is led by Eva Schönleitner as the CEO. [14]
Overview
Architecture
CrateDB operates in a shared-nothing architecture as a cluster of identically configured servers (nodes). The nodes coordinate to automatically distribute the execution of both write and query operations across the cluster.
Querying
CrateDB's SQL syntax includes JOINs, aggregations, indexes, sub-queries, user-defined functions, and views. It also supports full-text search, geospatial queries, and nested JSON object columns.
For query distribution, CrateDB implements memory-resident columnar field caches on each shard. The caches tell the query engine whether there are rows on that shard that meet the query criteria, and where the rows are located. This is performed automatically.
Schemas
CrateDB supports “strict”, “dynamic”, or “ignored” schemas:
- Strict schema: if an INSERT statement includes a column that wasn’t defined in the table, CrateDB enforces the original schema by rejecting the INSERT and throwing an error.
- Dynamic schema: CrateDB automatically updates the schema by indexing the new column.
- Ignored schema: CrateDB doesn’t index the column, but it stores the plain JSON value.
Consistency
CrateDB implements an eventually consistent, non-blocking data insertion model. It includes record versioning, optimistic concurrency control, and a table-level refresh frequency setting, which forces CrateDB data to become consistent every n milliseconds.
CrateDB supports read-after-write consistency: the queries retrieving a specific row by its primary key always receive the most recent row. All the other queries (search operations) return eventually-consistent data.
Search operations are performed on shared IndexReaders, which provide caching and reverse lookup capabilities for shards. An IndexReader is always bound to the Lucene segment from which it was started, meaning it has to be refreshed in order to see new changes. Therefore, a search only sees a change if the associated IndexReader was refreshed after that change occurred. By default, this is done once per second, but it can be reconfigured to occur more or less frequently.
Every replica shard is updated synchronously with its primary, and always carries the same information. Therefore, in terms of consistency, it does not matter if the primary or a replica shard is accessed. In CrateDB, only the refresh of the IndexReader affects consistency.
Atomicity and durability
CrateDB implements WAL (write-ahead logging):
- Operations on rows (which are internally stored in CrateDB as JSON documents) are atomic.
- Operations on rows are persisted to disk without having to issue a Lucene-commit for every write operation. When the translog gets flushed, all data is written to the persistent index storage of Lucene, and the translog gets cleared.
- In the case of an unclean shutdown of a shard, the transactions in the translog are replayed upon startup, to ensure that all executed operations are permanent.
- The translog is also directly transferred when a newly allocated replica initializes itself from the primary shard.
References
- Franz Rüf, Clemens Peter, Jodok Batlogg, Roland Alton-Scheidl (eds.): Open Source Initiative Vorarlberg. Perspektiven für Wirtschaft, Bildung und Verwaltung, 2005.
- "CrateDB packs NoSQL flexibility, SQL familiarity" InfoWorld. Dec. 19, 2016
- "Open Source Data Store Startup Crate Data Raises $1.5M From Sunstone And DFJ Esprit". TechCrunch. Retrieved 2021-01-13.
- "Vorarlberger Startup "Crate Data" ausgezeichnet". vol.at. Retrieved 2021-01-13.
- "Crate Data: Vorarlberger gewinnen bei Techcrunch Europe". Horizont.at.
- FinSMEs (2016-03-15). "Crate Technology Raises $4M in Funding". FinSMEs. Retrieved 2021-01-13.
- Francisco, Thomas Claburn in San. "Crate.io unboxes clustered SQL CrateDB, decamps to California". www.theregister.com. Retrieved 2021-01-13.
- Kepes, Ben (2016-12-14). "CrateDB: The IoT and machine data-focused database". Network World. Retrieved 2021-01-13.
- Yegulalp, Serdar (2017-05-16). "CrateDB 2.0 Enterprise stresses security and monitoring—and open source". InfoWorld. Retrieved 2021-01-13.
- "Crate.io Packs New Features, Services Into DB Upgrade". LinuxInsider. 2017-05-17. Retrieved 2021-01-13.
- "With version 2.0, Crate.io's database tools put an emphasis on IoT". TechCrunch. Retrieved 2021-01-13.
- FinSMEs (2017-01-02). "Crate.io Raises €2.5M in Seed Funding". FinSMEs. Retrieved 2021-01-13.
- "Version 4.0.0 — CrateDB: Reference". crate.io. Retrieved 2021-01-13.
- Crate.io (2020-09-09). "Eva Schönleitner Joins Crate.io as CEO". GlobeNewswire News Room. Retrieved 2021-01-13.