Designing Data-Intensive Applications | Chapter 02. Data Models and Query Languages
Presenters: Jungsoo Kim, Sumin Park
Chapter goals
- Understand the differences and characteristics of each data model.
- Be able to decide which data model is appropriate for the application you want to build.
RDB vs NoSQL
- Chapter 2 focuses only on the differences between data models.
- Other topics: fault tolerance in Chapter 5 and concurrency control in Chapter 7
- NoSQL
- Schema flexibility and better performance due to locality
- For some applications, it is closer to the data structures used by the application
- RDB
- Good support for joins, N:1 relationships, and N:N relationships
NoSQL
Problems with relational techniques that split document-like structures across multiple tables:
- Difficult schemas
- Unnecessary and complex application code
Limitations of the document model
- Nested items inside a document cannot be referenced directly. In a relational model, related data can be referenced directly if the key exists, although unnecessary traversal may be required to find the key.
- Example: referencing a user’s detailed address
- user.address.road.detail
- If nesting is not too deep, this is generally not a problem.
- Example: referencing a user’s detailed address
- Weak join support
- Depending on the application, this may or may not be a problem.
- Inefficient when many-to-many relationships are required
- Application complexity increases.
- Handling it in application code performs worse than specialized code inside the database.
Schema flexibility
It does not force a specific schema.
- Arbitrary keys and values can be added to a document.
- The existence of fields included in a document is not guaranteed.
Differences in processing by approach:
Example: adding first_name to user
// Document model
// You only need code that handles the case when the application reads data.
// (This method does not require a database change, but it does not seem like a good method. Treat it only as a way to understand the difference in approach.)
if (user?.name && user?.first_name == null) {
user.first_name = user.name.split(" ")[0]
}
// Relational model
// Separate schema change and data migration work is required.
ALTER TABLE users ADD COLUMN first_name test;
UPDATE users SET first_name = substring_index(name, ' ', 2);
Storage locality
Locality: the property of intensively accessing a specific part of data, a program, and so on.
If you frequently need to access an entire document, you can take advantage of storage locality.
- A denormalized document model structure has a performance advantage over a normalized relational model structure.
- A relational model needs multiple index lookups for retrieval, which means more disk seeks and more time.
- This applies only when you need many parts of the document at once.
- In a document model, the whole document must be stored even when only a small part is accessed, so large documents can be wasteful.
In general, it is recommended to keep documents as small as possible and minimize growth in document size.
- Because of these performance limitations, the situations where the document model is useful become much narrower.
Convergence of RDB and NoSQL
This means they provide similar features.
RDB
- Most RDBs except MySQL support XML, so they can use a data model very similar to the document model.
- PostgreSQL 9.3+, MySQL 5.7+, and DB2 10.5+ provide similar features for JSON documents.
NoSQL
- RethinkDB: supports relational joins in queries.
- MongoDB: the driver automatically resolves database references. Joins are performed on the client, require additional network round trips, are less optimized, and have poor join performance.
Query Language
Declarative vs imperative
- SQL is declarative.
- IMS and CODASYL are imperative, but not important here.
Advantages of declarative query languages:
- You only need to specify the pattern of the data you want to know. Which indexes and join functions to use, and in which order to execute them, are tasks for the database query optimizer.
- Conditions that must be satisfied
- Data transformations such as sorting, grouping, and aggregation
- They are generally more concise and easier to work with than imperative query APIs.
- Because the detailed implementation of the database engine is abstracted away, database performance can be improved without changing queries.
Web example:
When applying style to HTML:
- CSS is declarative.
- Using the DOM API with JS is imperative.
- Declarative style is better than imperative style in readability, productivity, and maintainability.
MapReduce queries
Not important here.
They are based on the map and reduce functions from functional programming.
Disadvantages:
- They are harder than writing a query.
- Declarative query languages give the query optimizer opportunities to improve query performance.
MongoDB 2.2 added support for a declarative query language called the aggregate pipeline.
Graph-like data models
If an application’s data model is mostly 1:N, such as tree-structured data, or if there are no relationships between entities, the document model is appropriate.
However, if N:N relationships are very common, a graph model is appropriate.
Graph components
Vertex, also called a node or entity. In Spring Data, it is called a node entity with @NodeEntity. Edge, also called a relationship or arc.
Examples:
- Social graph
- Vertex = person, edge = friendship
- Web graph
- Vertex = web page, edge = link
- Road network
- Vertex = intersection, edge = road
- Facebook
- Maintains many types of vertices and edges in a single graph.
- Vertex = person, place, event, check-in, comment, and so on.
- Edge = relationship between people, location where a check-in occurred, who commented on which post, who attended which event, and so on.
What if Facebook were built with a relational model?
- People, places, events, check-ins, comments, and so on would all be defined as tables.
- Extremely complex relationships between those tables would be required.
- The moment a graph model is applied, this complexity is simplified.
Types of graph models
- Property graph model
- Triple-store model, not important here
Declarative query languages for graphs
- Cypher
- SPARQL
- Datalog
Property graph
Components of a vertex:
- id
- Set of outgoing edges
- Set of incoming edges
- Collection of properties, as key-value pairs
Components of an edge:
- id
- Tail vertex where the edge starts
- Head vertex where the edge ends
- Label describing the relationship type between the two vertices
- Collection of properties, as key-value pairs
Representing a property graph with a relational schema:
CREATE TABLE vertices (
vertex_id integer PRIMARY KEY,
properties json
)
CREATE TABLE edges (
edge_id integer PRIMARY KEY,
tail_vertex integer REFERENCES vertices(vertex_id),
head_vertex integer REFERENCES vertices(vertex_id),
label text,
properties json
)
CREATE INDEX edges_tails ON edges(tail_vertex)
CREATE INDEX edges_heads ON edges(head_vertex)
- Vertices are connected to other vertices by edges.
- There is no schema that limits which types are related.
- Given a vertex, its incoming and outgoing edges can be found efficiently, and the graph can be traversed.
- By using different labels for different types of relationships, different types of information can be stored in a single graph while keeping the data model clean.
Through these features, graphs provide a great deal of flexibility for data modeling.
Graphs are highly evolvable, so when features are added to an application, the graph can be extended easily to accommodate data structure changes.
Cypher
A declarative query language for property graphs.
- It was created for the Neo4j graph database.
- Wiki: Cypher is a declarative graph query language that allows efficient querying and updating of property graphs expressed as patterns. Cypher is relatively simple but very powerful. Very complex database queries can be expressed easily with Cypher.
Creating a data model:
CREATE
(NAmerica:Location {name:'North America', type:'continent'}),
(USA:Location {name:'United States', type:'country'}),
(Idaho:Location {name:'Idaho', type:'state'}),
(Lucy:Person {name:'Lucy'}),
(Idaho) -[:WITHIN]-> (USA) -[:WITHIN]-> (NAmerica),
(Lucy) -[:BORN_IN]-> (Idaho)
Problem: find the names of everyone who emigrated from the United States to Europe.
MATCH
(person) -[:BORN_IN]-> () -[:WITHIN*0..]-> (USA:Location {name:'United States'}),
(person) -[:LIVES_IN]-> () -[:WITHIN*0..]-> (EU:Location {name:'Europe'})
RETURN person.name
How to execute the query:
- Start by looking up all people, check their birthplaces and places of residence, and return only the matching people.
- Start from two Location nodes, find all locations in the United States and Europe, and return people discovered through incoming BORN_IN and LIVES_IN edges on leaf vertices.
Advantages of declarative style:
- You do not need to describe the execution method in detail.
- The query optimizer automatically chooses the most efficient strategy.
What if this query were written in a relational model?
- It is possible, but difficult.
- In a relational model, the joins required for a query are usually known in advance and declared in the FROM clause.
- In a graph query, however, multiple edges must be traversed to find the desired vertex.
- You do not know how many traversals are required.
- The number of joins cannot be fixed in advance.
In a Cypher query, traversal is expressed very concisely with -[:WITHIN*0..]->. *0 means zero or more times.
Since SQL:1999, queries for variable-length traversal paths can be expressed using recursive common table expressions, or Recursive CTEs, with the WITH RECURSIVE statement.
- Supported by PostgreSQL, DB2, Oracle, and SQL Server.
- Supported in MySQL 5.7+.
Triple stores and SPARQL
Not important here.
Almost the same as the property graph model.
- Only the terminology for the same concepts differs.
Data is stored in very simple three-part statements: subject, predicate, and object.
- Subject = vertex
- Object = another vertex or primitive datatype data
- Predicate = edge
Turtle
- Wiki: Terse RDF Triple Language, or Turtle, is a syntax and file format for expressing data in the Resource Description Framework data model. Turtle syntax is similar to the syntax of SPARQL, an RDF query language.
SPARQL is a triple-store query language that uses the RDF data model.
Datalog
The data model is similar to the triple-store model.
Subject, predicate, object -> predicate(subject, object)
Graph database ranking
https://db-engines.com/en/ranking/graph+dbms
One-line summary: if you want to use a graph database, use Neo4j.
Neo4j
Supported by Spring Data.
https://spring.io/guides/gs/accessing-data-neo4j/
spring-boot-starter-data-neo4j
@NodeEntity
data class Food(
@Id
val id: Long? = null,
val name: String,
}
interface FoodRepository : Neo4jRepository<Food, Long>
@NodeEntity
data class Store(
@Id
val id: Long? = null,
val name: String,
@Relationship(type = "has")
val foods: Set<Food>,
}
interface StoreRepository : Neo4jRepository<Store, Long>
Summary
- Historically, people tried to represent data as one large tree.
- This was not appropriate for representing N:N relationships, so the relational model emerged.
- More recently, people found that some applications are not well suited to the relational model, which led to the emergence of NoSQL, a non-relational data model.
NoSQL has two main branches.
- Document model
- Used when all data is contained in documents and there are few relationships between documents.
- Graph model
- Used when, in contrast to the document model, everything is potentially related.
All three models are widely used today.
- One model can imitate another, but in most cases the result is messy.
Advantages of the document model and graph model:
- They do not force a schema for stored data, so applications can be changed more easily as requirements change.