Java Memory Model with mechanical sympathy

logo JDD

Tomasz Borek, @LAFK pl,

Why mechanical sympathy?

The most amazing achievement of the computer software industry is its continuing cancellation of the steady and staggering gains made by the computer hardware industry.
Henry Petroski
— an American engineer specializing in failure analysis

Hardware: orders of magnitude

  1. 1971, 4004: 4b, 640KB of mem, 60KOPS

  2. 1976: 8086: 16b, 1MB of mem, 5-10MHz

  3. 2017: Intel Core i9: x86-64 so 64b, 10 cores, 13,2MB L3, 3.3GHz

  4. 2017: AMD Ryzen Threadripper: x86-64, 16 cores, 32MB L3, 3.4GHz

Data after ComputerHope.

By comparison, software

there is no single development, in either technology or management technique, which by itself promises even one order of magnitude [tenfold] improvement within a decade in productivity, in reliability, in simplicity.
Fred Brooks
— No Silver Bullet – Essence and Accident in Software Engineering

Yes, higher-order languages (Ada, Python, Java…​), reuse, methodologies (Waterfall, Agile, Scrum), build tools, paradigms (OOP, FP), testing (TDD, BDD, ATDD), architectures (monolith, µservices, DDD)…​ but still.

Do we have something like Moore’s law? AFAIK, no.

Can multithreading be the silver bullet?

  1. Firstly, it’s hardware-based (cores)

  2. Amdahl says: depends on serial part of your program

  3. USL says: Amdahl forgot about contention and communication overhead

  4. but if we use it well and choose the problem/tooling well…​

By design

There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
sir C.A.R. Hoare
— QuickSort | null | CSP | monitor | Hoare logic creator

Mechanical sympathy

You don’t have to be an engineer to be be a racing driver, but you do have to have Mechanical Sympathy.
Jackie Stewart
— racing driver
Mechanical sympathy

knowing your hardware platform and applying that knowledge.

Eco-developer manifesto

Not only well-crafted software but also efficient software
Not only steadily adding value but also benchmarking
Not only community of professionals but also mechanical symphatizers
Not only productive partnerships but also reducing CO2 emissions

by Tomasz "@LAFK_pl" Borek in hopes of reducing global warming

About me

o mnie

Agenda

  1. Current situation and mechanical sympathy

  2. Multithreading for humans

  3. Hardware

  4. Java

  5. Summary and further reading

No time today

  1. external and some internal concurrency frameworks

  2. JIT, jcmd tool, Unsafe

  3. examples - they are linked only and

Will appear later - this is 1st iteration.

Technicalities - Java

  1. unless stated otherwise: OpenJDK 11, the LTS one.

  2. JLS for JDK 11, chapter 17 - Threads and Locks

    JCIP

    Java Concurrency in Practice, as depicted on https://jcip.net

Technicalities - OS

> inxi -S
System:    Host: lafk-T470 Kernel: 4.15.0-64-generic x86_64 bits: 64 Desktop: Xfce 4.12.3
           Distro: Ubuntu 18.04.3 LTS

Technicalities - box

lstopo result depicts Intel CPU

lstopo from hwloc package

Multiple

Singular read or write means one (1) read or write operation, so one (1) access to a given memory location.

Read_s_ or write_s_ are plural - meaning multiple reads and multiple writes and multiple memory accesses (and I may stop typing multiple from now on, yay!).

Questions?

Question mark

Hardware primer

lstopo result depicts Intel CPU

Intel mainly

Problem #0: Memory accesses conflict

  1. memory location can be read or written

  2. 2 accesses to same memory location conflict if at least one is a write

Problem #1: Data Race

  1. Multiple threads are executed by multiple executors at the same time.

  2. reads and writes happen from/to the same location (i.e. RAM)

  3. in other words: memory accesses conflict and happen simultaneously

Problem #1: Data Race

memory accesses conflict and happen simultaneously?

Non-deterministic outcomes on current hardware, AFAIK.

Problem #2: Visibility

  1. Multiple threads are executed by multiple executors at the same time.

  2. Writes and reads happen to RAM or caches.

  3. Thread actions interleave, usually there’s more than one program order.

  4. Classic counter problem

Problem #3: False sharing

  1. Multiple threads are executed by multiple executors at the same time.

  2. Memory is far, caches are on hand, data is saved to caches then

  3. Caches use LINES

  4. Shorter than cache line: something gets added

  5. Longer than cache line: gets broken

Messy cache coherency

  1. Modified: somebody changed it, flush to bus or keep going

  2. Exclusive: only I hold it unless bus says otherwise

  3. Shared: not Exclusive! Lots of bus reads. Bus writes?

  4. Invalid: this is where bus writes go. CPU write takes us to Exclusive.

Modern CPUs use newer variants with additional states.

Code optimizations

  1. Everybody want speed, code optimizations are a way

  2. Imperative code is natural for CPU (loop unrolling, function inlining)

  3. Caches are faster than RAM (prefetching, read/write ahead)

  4. Out-of-Order-Execution, OOE == branch prediction

  5. This may break the program

Code optimizations break the program how

  1. illegal order of statements: thread ends before it begins?

  2. Spectre (mitigations, not fixes)

  3. confuses the programmers (happens behind the scenes)

Problem #4: code optimizations

  1. Multiple threads are executed by multiple executors at the same time.

  2. Writes and reads happen to RAM or caches, i.e. L2.

  3. Statements in code can be rearranged for greater efficiency.

  4. Rearrangements can be done by compiler or CPU.

  5. Sometimes they can be illegal (thread ends before it begins?)

Memory models - solution to P#4

  1. have rules to distinguish illegal from legal

  2. order the events so the dev is less/not confused

  3. are divided depending on said order strength

    1. x8086 and JMM are mostly partial orders

  4. are on CPU microarchi or language level (MM or JMM)

Questions?

Question mark

Java

Multithreading stack

Fork Join framework

  1. ForkJoin itself, think map-reduce in Java

  2. memory-intensive, work-stealing, own thread pool

  3. Arrays.parallelSort > Arrays.sort

  4. parallelStream() > stream()

JCIP annotations

<dependency>
    <groupId>net.jcip</groupId>
    <artifactId>jcip-annotations</artifactId>
    <version>1.0</version>
</dependency>
Class, field, and method level annotations for describing thread-safety policies.
— package Javadoc

Class-level ones

  1. @Immutable: said class is immutable, implies @ThreadSafe

  2. @ThreadSafe: no extra synchro required, no matter what access sequence or thread interleaving

  3. @NotThreadSafe: as name states, optional, offered for extra clarity

Locking - guarding

  1. @GuardedBy("this") - intrinsic lock, synchronized case

  2. @GuardedBy("Some.class") - case for static synchronized

  3. More examples

ReadWriteLock

  1. separates reads from writes

  2. default impl is also reentrant

  3. has a pair of locks,

  4. read one can be held multiple times as long as write one isn’t held

  5. writer lock is exclusive and can be downgraded to read lock

Useful API

  1. ThreadLocalRandom > Random

  2. CompletableFuture - FP and async and other examples

  3. FutureTask - bridge between Runnable and Callable

  4. InheritableThreadLocal - works on sub-threads too

  5. TransferQueue or SynchronousQueue

Testing

exhaustive, repeated testing, stress testing, model checking, benchmarking

Solutions

  1. P#0: separate reads from writes, e.g. ReentrantLock

  2. P#1: immutable classes, local variables, ThreadLocal<YourObject>

  3. P#2: as above + synchronization and atomicity (so JMM keywords)

  4. P#3: padding, @Contented or BlackHole from JMH

  5. P#4: memory model

JMM

JMM, Java Memory Model

set of rules determining happens-before relationship and order of execution. Described in JLS 17.4, uses final, volatile and synchronized keywords.

Java Memory Model

A memory model describes, given a program and an execution trace of that program, whether the execution trace is a legal execution of the program. The Java programming language memory model works by examining each read in an execution trace and checking that the write observed by that read is valid according to certain rules.
— JLS 17.4

JMM elements

  1. program order. synchronization order

  2. happens-before, synchronized-with

  3. hardware-supported via memory barriers

Java Object Layout

  1. fields layout, header & alignment info

  2. externals - what is reachable from our object

  3. footprint estimate

  4. works with different VM modes (compressed OOPs, 32/64b)

Java Microbenchmark Harness

Benchmarking standard and great testing tool

Questions?

Question mark

Summarizing

  1. Mechanical sympathy - know and use your HW, be eco and perf,

  2. mthreading may be silver bullet sometimes (Amdahl and USL say when)

  3. Avoid or minimize conflicting accesses / data races

    1. less heap more stack

    2. no shared state > share immutables > share carefully synchronized / atomicized

  4. JCIP annotations, juc, actors, JOL, JMH, lstopo (hwloc)

Multithreading for humans

  1. multithreading is the norm now, such bugs will bite also you

  2. high-level API require understanding to be used properly

  3. same with multithreading design patterns, including those used in JDK

  4. performance

Absolute minimum 4 everyday programmer

  1. Amdahl’s and USL’s math

  2. Local variables rulez

  3. Immutable classes

  4. JCIP annotations

  5. ForkJoin and parallel streams

Further reading

  1. JMM: JLS 17.4,

  2. JCIP

  3. Oracle Trail on Concurrency is easy for beginners and ain’t deep at all

  4. Alexey Shipilev - JMM pragmatics

  5. any related papers by Leslie Lamport, Sarita Adve or Hans Boehm

Questions?

Question mark