Designing Data-Intensive Applications | Chapter 02. Data Models and Query Languages

Tags:

Presenters: Jungsoo Kim, Sumin Park

Presentation materials

Chapter goals

Understand the differences and characteristics of each data model.
Be able to decide which data model is appropriate for the application you want to build.

RDB vs NoSQL

Chapter 2 focuses only on the differences between data models.
- Other topics: fault tolerance in Chapter 5 and concurrency control in Chapter 7
NoSQL
- Schema flexibility and better performance due to locality
- For some applications, it is closer to the data structures used by the application
RDB
- Good support for joins, N:1 relationships, and N:N relationships

NoSQL

Problems with relational techniques that split document-like structures across multiple tables:

Difficult schemas
Unnecessary and complex application code

Limitations of the document model

Nested items inside a document cannot be referenced directly. In a relational model, related data can be referenced directly if the key exists, although unnecessary traversal may be required to find the key.
- Example: referencing a user’s detailed address
  - user.address.road.detail
- If nesting is not too deep, this is generally not a problem.
Weak join support
- Depending on the application, this may or may not be a problem.
Inefficient when many-to-many relationships are required
- Application complexity increases.
- Handling it in application code performs worse than specialized code inside the database.

Tip

When there are many interconnections: graph model > relational model >>> document model

Schema flexibility

It does not force a specific schema.

Arbitrary keys and values can be added to a document.
The existence of fields included in a document is not guaranteed.

Tip

It does not force a schema, but that does not mean there is no schema. - Relational model = schema-on-write - The data structure is explicit. - The database guarantees that all data follows the schema. - Document model = schema-on-read - The data structure is implicit. - It is interpreted only when the data is read. There are no constraints when writing.

Differences in processing by approach:

Example: adding first_name to user

// Document model
// You only need code that handles the case when the application reads data.
// (This method does not require a database change, but it does not seem like a good method. Treat it only as a way to understand the difference in approach.)
if (user?.name && user?.first_name == null) {
    user.first_name = user.name.split(" ")[0]
}
 
 
// Relational model
// Separate schema change and data migration work is required.
ALTER TABLE users ADD COLUMN first_name test;
UPDATE users SET first_name = substring_index(name, ' ', 2);

Tip

Schema changes are slow and require downtime. - As an exception, schema changes in MySQL have very poor performance.

Tip

When adding a column, all existing records have no data for that column, so: 1. First, make it nullable so it can contain null data. 2. Update the data. If the database is large, the update may take a long time. 3. Change it back to the original attribute, nullable or non-null.

Storage locality

Locality: the property of intensively accessing a specific part of data, a program, and so on.

If you frequently need to access an entire document, you can take advantage of storage locality.

A denormalized document model structure has a performance advantage over a normalized relational model structure.
- A relational model needs multiple index lookups for retrieval, which means more disk seeks and more time.
This applies only when you need many parts of the document at once.
- In a document model, the whole document must be stored even when only a small part is accessed, so large documents can be wasteful.

In general, it is recommended to keep documents as small as possible and minimize growth in document size.

Because of these performance limitations, the situations where the document model is useful become much narrower.

Tip

The relational model also has concepts where locality is applied. - Spanner, Google's database: interleaved tables - https://cloud.google.com/spanner/docs/schema-and-data-model#creating-interleaved-tables - Oracle: multi-table index cluster tables - Cassandra and HBase: column families

Convergence of RDB and NoSQL

This means they provide similar features.

RDB

Most RDBs except MySQL support XML, so they can use a data model very similar to the document model.
PostgreSQL 9.3+, MySQL 5.7+, and DB2 10.5+ provide similar features for JSON documents.

NoSQL

RethinkDB: supports relational joins in queries.
MongoDB: the driver automatically resolves database references. Joins are performed on the client, require additional network round trips, are less optimized, and have poor join performance.

Query Language

Declarative vs imperative

SQL is declarative.
IMS and CODASYL are imperative, but not important here.

Advantages of declarative query languages:

You only need to specify the pattern of the data you want to know. Which indexes and join functions to use, and in which order to execute them, are tasks for the database query optimizer.
- Conditions that must be satisfied
- Data transformations such as sorting, grouping, and aggregation
They are generally more concise and easier to work with than imperative query APIs.
Because the detailed implementation of the database engine is abstracted away, database performance can be improved without changing queries.

Web example:

When applying style to HTML:

CSS is declarative.
Using the DOM API with JS is imperative.
Declarative style is better than imperative style in readability, productivity, and maintainability.

Tip

In databases as well, declarative query languages such as SQL are said to be much better than imperative query APIs.

MapReduce queries

Not important here.

They are based on the map and reduce functions from functional programming.

Disadvantages:

They are harder than writing a query.
Declarative query languages give the query optimizer opportunities to improve query performance.

MongoDB 2.2 added support for a declarative query language called the aggregate pipeline.

Graph-like data models

If an application’s data model is mostly 1:N, such as tree-structured data, or if there are no relationships between entities, the document model is appropriate.

However, if N:N relationships are very common, a graph model is appropriate.

Tip

The relational model is suitable up to simple N:N relationships, but it is not suitable for handling more complex N:N relationships because complexity increases.

Graph components

Vertex, also called a node or entity. In Spring Data, it is called a node entity with @NodeEntity. Edge, also called a relationship or arc.

Examples:

Social graph
- Vertex = person, edge = friendship
Web graph
- Vertex = web page, edge = link
Road network
- Vertex = intersection, edge = road
Facebook
- Maintains many types of vertices and edges in a single graph.
- Vertex = person, place, event, check-in, comment, and so on.
- Edge = relationship between people, location where a check-in occurred, who commented on which post, who attended which event, and so on.

What if Facebook were built with a relational model?

People, places, events, check-ins, comments, and so on would all be defined as tables.
Extremely complex relationships between those tables would be required.
The moment a graph model is applied, this complexity is simplified.

Types of graph models

Property graph model
Triple-store model, not important here

Declarative query languages for graphs

Cypher
SPARQL
Datalog

Property graph

Components of a vertex:

id
Set of outgoing edges
Set of incoming edges
Collection of properties, as key-value pairs

Components of an edge:

id
Tail vertex where the edge starts
Head vertex where the edge ends
Label describing the relationship type between the two vertices
Collection of properties, as key-value pairs

Tip

The direction of an edge is tail -> head, so the tail is the start and the head is the end.

Representing a property graph with a relational schema:

CREATE TABLE vertices (
    vertex_id integer PRIMARY KEY,
    properties json
)
 
 
CREATE TABLE edges (
    edge_id integer PRIMARY KEY,
    tail_vertex integer REFERENCES vertices(vertex_id),
    head_vertex integer REFERENCES vertices(vertex_id),
    label text,
    properties json
)
 
 
CREATE INDEX edges_tails ON edges(tail_vertex)
CREATE INDEX edges_heads ON edges(head_vertex)

Vertices are connected to other vertices by edges.
- There is no schema that limits which types are related.
Given a vertex, its incoming and outgoing edges can be found efficiently, and the graph can be traversed.
By using different labels for different types of relationships, different types of information can be stored in a single graph while keeping the data model clean.

Through these features, graphs provide a great deal of flexibility for data modeling.

Graphs are highly evolvable, so when features are added to an application, the graph can be extended easily to accommodate data structure changes.

Cypher

A declarative query language for property graphs.

It was created for the Neo4j graph database.
Wiki: Cypher is a declarative graph query language that allows efficient querying and updating of property graphs expressed as patterns. Cypher is relatively simple but very powerful. Very complex database queries can be expressed easily with Cypher.

Creating a data model:

CREATE
(NAmerica:Location {name:'North America', type:'continent'}),
(USA:Location {name:'United States', type:'country'}),
(Idaho:Location {name:'Idaho', type:'state'}),
(Lucy:Person {name:'Lucy'}),
(Idaho) -[:WITHIN]-> (USA) -[:WITHIN]-> (NAmerica),
(Lucy) -[:BORN_IN]-> (Idaho)

Problem: find the names of everyone who emigrated from the United States to Europe.

MATCH
(person) -[:BORN_IN]-> () -[:WITHIN*0..]-> (USA:Location {name:'United States'}),
(person) -[:LIVES_IN]-> () -[:WITHIN*0..]-> (EU:Location {name:'Europe'})
RETURN person.name

How to execute the query:

Start by looking up all people, check their birthplaces and places of residence, and return only the matching people.
Start from two Location nodes, find all locations in the United States and Europe, and return people discovered through incoming BORN_IN and LIVES_IN edges on leaf vertices.

Advantages of declarative style:

You do not need to describe the execution method in detail.
The query optimizer automatically chooses the most efficient strategy.

What if this query were written in a relational model?

It is possible, but difficult.
In a relational model, the joins required for a query are usually known in advance and declared in the FROM clause.
In a graph query, however, multiple edges must be traversed to find the desired vertex.
- You do not know how many traversals are required.
- The number of joins cannot be fixed in advance.

In a Cypher query, traversal is expressed very concisely with -[:WITHIN*0..]->. *0 means zero or more times.

Since SQL:1999, queries for variable-length traversal paths can be expressed using recursive common table expressions, or Recursive CTEs, with the WITH RECURSIVE statement.

Supported by PostgreSQL, DB2, Oracle, and SQL Server.
Supported in MySQL 5.7+.

Triple stores and SPARQL

Not important here.

Almost the same as the property graph model.

Only the terminology for the same concepts differs.

Data is stored in very simple three-part statements: subject, predicate, and object.

Subject = vertex
Object = another vertex or primitive datatype data
Predicate = edge

Turtle

Wiki: Terse RDF Triple Language, or Turtle, is a syntax and file format for expressing data in the Resource Description Framework data model. Turtle syntax is similar to the syntax of SPARQL, an RDF query language.

SPARQL is a triple-store query language that uses the RDF data model.

Datalog

The data model is similar to the triple-store model.

Subject, predicate, object -> predicate(subject, object)

Graph database ranking

https://db-engines.com/en/ranking/graph+dbms

One-line summary: if you want to use a graph database, use Neo4j.

Neo4j

Supported by Spring Data.

https://spring.io/guides/gs/accessing-data-neo4j/

spring-boot-starter-data-neo4j

@NodeEntity
data class Food(
    @Id
    val id: Long? = null,
    val name: String,
}
 
 
interface FoodRepository : Neo4jRepository<Food, Long>
 
 
@NodeEntity
data class Store(
    @Id
    val id: Long? = null,
    val name: String,
    @Relationship(type = "has")
    val foods: Set<Food>,
}
 
 
interface StoreRepository : Neo4jRepository<Store, Long>

Summary

Historically, people tried to represent data as one large tree.
This was not appropriate for representing N:N relationships, so the relational model emerged.
More recently, people found that some applications are not well suited to the relational model, which led to the emergence of NoSQL, a non-relational data model.

NoSQL has two main branches.

Document model
- Used when all data is contained in documents and there are few relationships between documents.
Graph model
- Used when, in contrast to the document model, everything is potentially related.

All three models are widely used today.

One model can imitate another, but in most cases the result is messy.

Advantages of the document model and graph model:

They do not force a schema for stored data, so applications can be changed more easily as requirements change.