Is it really as great as they say it is?
Distributed system, CAP theorem, NoSQL (Mongo) promise and delivery.
Tomasz Borek, JAP head and leader since 2016, Mongo certified in 2012
New Mongo releases sparked another wave of interests and inquiries. Among my customers, several decided to use Mongo DB for their applications. Some asked my advice. Hence, the research which led to this talk.
Mongo is nice and has nice documentation. Small projects will like it. Larger projects or projects which skimped on research may find themselves scrambling for PostgreSQL later on.
Mongo’s marketing is VERY persuasive and powerful. Do your own leg work or you and your project may suffer. Employ Jepsen and study it’s analyses. |
Mongo is VERY popular. Widely popular. M in MEAN is for Mongo. Among SpringData first NoSQL DBs was Mongo (if not the very first!). It’s incredibly easy to set up. Mongo marketing needs the popularity and drives it.
Mongo Express Angular Node
The real reason to learn MEAN stack: Employability
While there are drivers to do queries in multiple languages, Mongo (written in C++, JS and Python) is JS-tied (or perhaps front-end / web tied)
Mongo shell is written in JS and uses JS
Mongo uses JSON / BSON
MEAN
NoSQL == hype
Distributed == Microservices == Hype
Mongo User Group talks
Blog posts (some from Mongo employees)
Hackathons
MEAN (web-tied)
Easy to install, dockerized… need I say more?
had global lock crippling intense multithreaded workflows (still has, but…)
had "safe saves" feature hidden and turned off by default
loses data sometimes
even lost data during READ!
doesn’t pass Jepsen tests
doesn’t support relations (almost) at all
doesn’t offer full ACID transactions
has non-isolated transactions and plagued by anomalies
has unsafe defaults compromising everything for marketability
had a security disaster with 30k DBs being taken over / ransomed
requires for the working set to fit in RAM (or crawls)
may roll-back your data unless you take great pains not to
actually discourages arbiter cause sharding and 'reasons'
has transaction defaults that override the settings on a collection or DB?
I could go on. The list of surprises isn’t short, when it comes to Mongo.
do you even need that kind of scale?
what have you tried with RDBMSes so far?
do you need active-active?
do you need distributed transactions?
why are you after Mongo?
audits: code, infra, components, systems
tests and audits of performance or security
diving into DBs, GNU/Linuxes, net or security
programmer for hire
talks, workshops, consulting, trainings
November - February
9h of lectures a week, Mon-Wed-Fri 8-11
then project, Academy or just a lot of knowledge and some experience - depending on your results
mainly Kraków or Gdańsk
https://epa.ms/PreAcademy - apply now
https://epa.ms/subPreJAP - subscribe to know more later (set location if you want to)
And what was offered? To understand, we need to delve a bit into what is what.
NoSQL and CAP and Mongo
Distributed systems, scaling, transactions and Mongo
Relational DBs had decades to get to the point they have reached. They postulated 3rd normal form, holding data in the DB, relationships (so, math: algebra) and indexes (and many more but I’ll stop here). And they continued for years with two sentences: you don’t want active-active clusters" and "you don’t want distributed transactions". Along came NoSQL…
Take your pick:
no SQL, none, nada, zilch
not only SQL, cause we have polyglot persistence
no idea actually
new SQL
no, SQL
once upon a time the systems were divided
you cannot ignore partition tolerance
and so, the schism began: you were either CP (Mongo) or AP… or you were Even Worse™ - CA!
no relational DBs (or less, or we can work with’em)
distributed, active-active
scalable (in, out, up, sideways, you name it)
big data
easy
Whatever they wrote on the tin. So… anything. Everything.
Mongo is CP, so consistency - yet watch out for how they define it (also, wait a few slides)
Mongo is for big data, cause hu-MONGO-us - but it’s also RAM-limited and sharding is only now being worked out
Mongo is good for financial things - until 3 busts with bitcoin startups
"we live in post-transactional era" - no transactions - oh wait, we don’t. So local, not for sharding. Oh wait, FOR sharding BUT with limits…
replication relies on having a backup copy, assuring you don’t lose data
Mongo may lose data in replication, due to 'roll-backs', crazy defaults, election problems
sharding is a technique of horizontal partitioning, good to scale
if you use sharding, some things won’t work in Mongo or will be limited
Distributed ain’t easy!
A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.
you have a cluster? you have a distributed system
add network problems
add clustering problems
add your own system problems (your system tries to do things, right?)
ACID
MVCC
Isolation levels
cluster == distributed system == network+clustering+your system problems
you also have transactions? try distributed MVCC or distributed ATOMICity, with non-zero latency, or a changed route
add rollback, distributed one, you partially applied the things, now you roll them back, from entire cluster, with replication, oh, and your network has just partitioned
active-passive problems
add multiple concurrent/simultaneous writes - everybody accepts writes now
do you want your fries, ekhem, transactions with that?
Consistency, Availability, Partition Tolerance, you can’t sacrifice the latter
A fantastic piece of engineering, a tool to check if a distributed system actually handles itself well when partition happens, under load, etc. Shout out to Kyle for his incredible tool and head over to jepsen.io please.
Tests systems' behaviour, especially when a network partition occurs.
Tests partition tolerance.
And tests what happens when a partition HEALS.
That’s where the devil is, right there, in the details.
active passive(s) arbiter (replication)
active passives (replication)
sharded replica sets, one replica set for a shard (I’m skipping config here for simplicity) - that’s for big data / scaling
a partition happened
and the primary went down
and it healed in the middle of an election
or right as it finished with a new primary
and partition healed, showing the other part of the network also elected a primary?
MongoDB is nice and their marketing promises the sky and beyond. It is VERY popular.
Many have failed, but this scarcely left a dent on Mongo sites or in it’s materials.
Be VERY careful! |
Distributed systems are VERY HARD, transactions are hard
RDBMS makers didn’t want to do active-active clusters or distributed transactions or massive scaling, which NoSQL promised
NoSQL is not very well defined, some definitions definitely have overpromised
CAP theorem has been revised since it’s inception - original division being too strict (and no latency?!)
Mongo promises to be good for all use cases
Read the small print in their docs (go deeper)
Read and understand Jepsen analyses
Do your own tests - or pay Jepsen to
Consider unusable scenarios (transactions? low RAM?)
NoSQL is now "Not Only SQL" - consider scenarios for chosen technology
CAP theorem was revised - are you big data? When are you CA or AP or CP?
Replication is difficult in active-active, do you need it?
Sharding is even more so
Distributed transactions are hard and dangerous
We do move forward, and the NoSQL DB’s now are much better than their first generation. But do read the fine print and do use them per their scenarios. Also, consider LOTSA tests for your use case, using things outlined here as an idea generator. :)
Mongo on it’s own is maturing, undeniable. Version 5.0 is much better than 2.0.
Also, each time Jepsen tests Mongo, their docs get better and they correct something.
Thank you, TJB out. I’ll gladly take badges if you liked this… or emails / chats if you did not. :-)