MongoDB

a sceptic’s view

Is it really as great as they say it is?
Distributed system, CAP theorem, NoSQL (Mongo) promise and delivery.

Tomasz Borek, JAP head and leader since 2016, Mongo certified in 2012

About this talk and myself

New Mongo releases sparked another wave of interests and inquiries. Among my customers, several decided to use Mongo DB for their applications. Some asked my advice. Hence, the research which led to this talk.

TL;DR

Mongo is nice and has nice documentation. Small projects will like it. Larger projects or projects which skimped on research may find themselves scrambling for PostgreSQL later on.

Mongo’s marketing is VERY persuasive and powerful. Do your own leg work or you and your project may suffer. Employ Jepsen and study it’s analyses.

MEAN

MEAN

Mongo Express Angular Node

The real reason to learn MEAN stack: Employability
— Free Code Camp Blog post from 2014

Languages

While there are drivers to do queries in multiple languages, Mongo (written in C++, JS and Python) is JS-tied (or perhaps front-end / web tied)

  1. Mongo shell is written in JS and uses JS

  2. Mongo uses JSON / BSON

  3. MEAN

Hype

  1. NoSQL == hype

  2. Distributed == Microservices == Hype

  3. Mongo User Group talks

  4. Blog posts (some from Mongo employees)

  5. Hackathons

  6. MEAN (web-tied)

  7. Easy to install, dockerized…​ need I say more?

Surprise! Did you know, that Mongo…​

  1. had global lock crippling intense multithreaded workflows (still has, but…​)

  2. had "safe saves" feature hidden and turned off by default

  3. loses data sometimes

    1. even lost data during READ!

  4. doesn’t pass Jepsen tests

  1. doesn’t support relations (almost) at all

  2. doesn’t offer full ACID transactions

  3. has non-isolated transactions and plagued by anomalies

  4. has unsafe defaults compromising everything for marketability

  5. had a security disaster with 30k DBs being taken over / ransomed

  1. requires for the working set to fit in RAM (or crawls)

  2. may roll-back your data unless you take great pains not to

  3. actually discourages arbiter cause sharding and 'reasons'

  4. has transaction defaults that override the settings on a collection or DB?

I could go on. The list of surprises isn’t short, when it comes to Mongo.

Do you really need it?

do you even need that kind of scale?
what have you tried with RDBMSes so far?

do you need active-active?
do you need distributed transactions?

why are you after Mongo?

About me

LAFK pl

In the Internet

LAFK pl wSieci

What I do

  1. audits: code, infra, components, systems

  2. tests and audits of performance or security

  3. diving into DBs, GNU/Linuxes, net or security

  4. programmer for hire

  5. talks, workshops, consulting, trainings

Junior Java Academy Online

  1. November - February

  2. 9h of lectures a week, Mon-Wed-Fri 8-11

  3. then project, Academy or just a lot of knowledge and some experience - depending on your results

  4. mainly Kraków or Gdańsk

https://epa.ms/subPreJAP - subscribe to know more later (set location if you want to)

Questions?

question mark

What was promised?

And what was offered? To understand, we need to delve a bit into what is what.

  1. NoSQL and CAP and Mongo

  2. Distributed systems, scaling, transactions and Mongo

Databases are hard

Relational DBs had decades to get to the point they have reached. They postulated 3rd normal form, holding data in the DB, relationships (so, math: algebra) and indexes (and many more but I’ll stop here). And they continued for years with two sentences: you don’t want active-active clusters" and "you don’t want distributed transactions". Along came NoSQL…​

What does NoSQL mean?

Take your pick:

  1. no SQL, none, nada, zilch

  2. not only SQL, cause we have polyglot persistence

  3. no idea actually

  4. new SQL

  5. no, SQL

800
Figure 1. pic from Mark Maddsen’s talk at StrataCon 2013

The original CAP theorem

once upon a time the systems were divided

you cannot ignore partition tolerance
— Wise Man

and so, the schism began: you were either CP (Mongo) or AP…​ or you were Even Worse™ - CA!

Database Systems according to the CAP Theorem
Figure 2. Credits: Security and Privacy Implications on Database Systems in Big Data Era: A Survey

The NoSQL promise

  1. no relational DBs (or less, or we can work with’em)

  2. distributed, active-active

  3. scalable (in, out, up, sideways, you name it)

  4. big data

  5. easy

Whatever they wrote on the tin. So…​ anything. Everything.

Mongo’s promise?

  1. Mongo is CP, so consistency - yet watch out for how they define it (also, wait a few slides)

  2. Mongo is for big data, cause hu-MONGO-us - but it’s also RAM-limited and sharding is only now being worked out

  3. Mongo is good for financial things - until 3 busts with bitcoin startups

  4. "we live in post-transactional era" - no transactions - oh wait, we don’t. So local, not for sharding. Oh wait, FOR sharding BUT with limits…​

When Mongo completes a write?

HackingDistributed EminGunSirer MongoCompletesWriteWhen
Figure 3. Credits: HackingDistributed, a blog of Emin Gün Sirer

Replication and sharding

  1. replication relies on having a backup copy, assuring you don’t lose data

    1. Mongo may lose data in replication, due to 'roll-backs', crazy defaults, election problems

  2. sharding is a technique of horizontal partitioning, good to scale

    1. if you use sharding, some things won’t work in Mongo or will be limited

Distributed ain’t easy!

Distributed systems are hard

A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.
— Leslie Lamport

Network Fallacies

800
Figure 4. Denise Yu illustrated the 8 Fallacies from Peter L. Deutsch and others

Active passive cluster

  1. you have a cluster? you have a distributed system

  2. add network problems

  3. add clustering problems

  4. add your own system problems (your system tries to do things, right?)

Transactions in SQL

ACID

MVCC

Isolation levels

Active passive cluster with transactions

  1. cluster == distributed system == network+clustering+your system problems

  2. you also have transactions? try distributed MVCC or distributed ATOMICity, with non-zero latency, or a changed route

  3. add rollback, distributed one, you partially applied the things, now you roll them back, from entire cluster, with replication, oh, and your network has just partitioned

You don’t want active - active

  1. active-passive problems

  2. add multiple concurrent/simultaneous writes - everybody accepts writes now

  3. do you want your fries, ekhem, transactions with that?

    CAP

    Consistency, Availability, Partition Tolerance, you can’t sacrifice the latter

Jepsen

A fantastic piece of engineering, a tool to check if a distributed system actually handles itself well when partition happens, under load, etc. Shout out to Kyle for his incredible tool and head over to jepsen.io please.

What does it do?

Tests systems' behaviour, especially when a network partition occurs.

Tests partition tolerance.

And tests what happens when a partition HEALS.
That’s where the devil is, right there, in the details.

Mongo clusters

  1. active passive(s) arbiter (replication)

  2. active passives (replication)

  3. sharded replica sets, one replica set for a shard (I’m skipping config here for simplicity) - that’s for big data / scaling

Again: when Mongo completes a write?

HackingDistributed EminGunSirer MongoCompletesWriteWhen
Figure 5. Credits: HackingDistributed, a blog of Emin Gün Sirer

Now what if…​

  1. a partition happened

  2. and the primary went down

  3. and it healed in the middle of an election

  4. or right as it finished with a new primary

  5. and partition healed, showing the other part of the network also elected a primary?

Questions?

question mark

Summarizing

MongoDB is nice and their marketing promises the sky and beyond. It is VERY popular.

Many have failed, but this scarcely left a dent on Mongo sites or in it’s materials.

Be VERY careful!

Overpromised

  1. Distributed systems are VERY HARD, transactions are hard

  2. RDBMS makers didn’t want to do active-active clusters or distributed transactions or massive scaling, which NoSQL promised

  3. NoSQL is not very well defined, some definitions definitely have overpromised

  4. CAP theorem has been revised since it’s inception - original division being too strict (and no latency?!)

Mongo? WARNING! WARNING! WARNING!

  1. Mongo promises to be good for all use cases

  2. Read the small print in their docs (go deeper)

  3. Read and understand Jepsen analyses

  4. Do your own tests - or pay Jepsen to

  5. Consider unusable scenarios (transactions? low RAM?)

NoSQL? Careful!

  1. NoSQL is now "Not Only SQL" - consider scenarios for chosen technology

  2. CAP theorem was revised - are you big data? When are you CA or AP or CP?

  3. Replication is difficult in active-active, do you need it?

  4. Sharding is even more so

  5. Distributed transactions are hard and dangerous

Silver lining?

We do move forward, and the NoSQL DB’s now are much better than their first generation. But do read the fine print and do use them per their scenarios. Also, consider LOTSA tests for your use case, using things outlined here as an idea generator. :)

Mongo on it’s own is maturing, undeniable. Version 5.0 is much better than 2.0.

Also, each time Jepsen tests Mongo, their docs get better and they correct something.

Thank you!

Thank you, TJB out. I’ll gladly take badges if you liked this…​ or emails / chats if you did not. :-)