Archiloque weekly

Links - 11th August 2019

Run your own social: How to run a small social network site for your friends

The main reason to run a small social network site is that you can create an online environment tailored to the needs of your community in a way that a big corporation like Facebook or Twitter never could. Yes, you can always start a Facebook Group for your community and moderate that how you like, but only within certain bounds set by Facebook. If you (or your community) run the whole site, then you are ultimately the boss of what goes on. It is harder work than letting Facebook or Twitter or Slack or Basecamp or whoever else take care of everything, but I believe it’s worth it.

Let’s go back to Friend Camp. While there are a hundred thousand people we can talk to from Friend Camp, there are only about 50 people with an active Friend Camp login. We call ourselves “campers” because we are corny like that. And campers have a special communication channel that lets us post messages that only other campers can see.

If I make software that makes the lives of 50 people much nicer, and it makes 0 people more miserable, then on the balance I think I’m doing better than a lot of programmers in the world. Because we’re mostly all friends with each other, this extra communications mode is kind of like a group chat on steroids. For our community it ends up being a sort of hybrid between Twitter and a group chat. As a result of having a community layer alongside a more public layer, we have a movie night, a book club, and a postcard club. Campers visit each other when we travel, even if we’ve never met in person before. We correspond with each other about what we’re making for dinner and trade recipes. They’re the kind of mundane interactions that you probably don’t want to have with perfect strangers but you cherish in a group of people you care about.

We are also able to have moderation rules that are hyper-specific to our own values as a community. It lets us maintain an environment that’s far more pleasant than you find on most social media sites.

“What The Hardware Does” is not What Your Program Does: Uninitialized Memory

Maybe the most important lesson to take away from this post is that “what the hardware does” is most of the time irrelevant when discussing what a Rust/C/C++ program does, unless you already established that there is no undefined behavior. Sure, hardware (well, most hardware) does not have a notion of “uninitialized memory”. But the Rust program you wrote does not run on your hardware. It runs on the Rust abstract machine, and that machine (which only exists in our minds) does have a notion of “uninitialized memory”. The real, physical hardware that we end up running the compiled program on is a very efficient but imprecise implementation of this abstract machine, and all the rules that Rust has for undefined behavior work together to make sure that this imprecision is not visible for well-behaved (UB-free) programs. But for programs that do have UB, this “illusion” breaks down, and anything is possible.

Only UB-free programs can be made sense of by looking at their assembly, but whether a program has UB is impossible to tell on that level. For that, you need to think in terms of the abstract machine.[1]

This does not just apply to uninitialized memory: for example, in x86 assembly, there is no difference between “relaxed” and “release”/“acquire”-style atomic memory accesses. But when writing Rust programs, even when writing Rust programs that you only intend to compile to x86, “what the hardware does” just does not matter if your program has UB. The Rust abstract machine does make a distinction between “relaxed” and “release”/“acquire”, and your program will go wrong if you ignore that fact. After all, x86 does not have “uninitialized bytes” either, and still our example program above went wrong.

Of course, to explain why the abstract machine is defined the way it is, we have to look at optimizations and hardware-level concerns. But without an abstract machine, it is very hard to ensure that all the optimizations a compiler performs are consistent — in fact, both LLVM and GCC suffer from miscompilations caused by combining optimizations that all seem fine in isolation, but together cause incorrect code generation. The abstract machine is needed as an ultimate arbiter that shows if all of the optimizations are correct, or if some of them are in conflict with each other. I also think that when writing unsafe code, it is much easier to keep in your head a fixed abstract machine as opposed to a set of optimizations that might change any time, and might or might not be applied in any order.

Unfortunately, in my opinion not enough of the discussion around undefined behavior in Rust/C/C++ is focused on what concretely the “abstract machine” of these languages looks like. Instead, people often talk about hardware behavior and how that can be altered by a set of allowed optimizations — but the optimizations performed by compilers change as new tricks are discovered, and it’s the abstract machines that define if these tricks are allowed. C/C++ have extensive standards that describe many cases of undefined behavior in great detail, but nowhere does it say that memory of the C/C++ abstract machine stores Option<u8> instead of the u8 one might naively expect. In Rust, I am very happy that we have Miri, which is meant to be (very close to) a direct implementation of the Rust abstract machine, and I am strongly advocating for us to write the Rust specification (once that gets going) in a style that makes this machine very explicit. I hope C/C++ will come around to do the same, and there is some great work in that direction, but only time will tell to what extend that can affect the standard itself.

Files are fraught with peril

In conclusion, computers don’t work (but you probably already know this if you’re here at Gary-conf). This talk happened to be about files, but there are many areas we could’ve looked into where we would’ve seen similar things.

One thing I’d like to note before we finish is that, IMO, the underlying problem isn’t technical. If you look at what huge tech companies do (companies like FB, Amazon, MS, Google, etc.), they often handle writes to disk pretty safely. They’ll make sure that they have disks where power loss protection actually work, they’ll have patches into the OS and/or other instrumentation to make sure that errors get reported correctly, there will be large distributed storage groups to make sure data is replicated safely, etc. We know how to make this stuff pretty reliable. It’s hard, and it takes a lot of time and effort, i.e., a lot of money, but it can be done.

If you ask someone who works on that kind of thing why they spend mind boggling sums of money to enure (or really, increase the probability of) correctness, you’ll often get an answer like “we have a zillion machines and if you do the math on the rate of data corruption, if we didn’t do all of this, we’d have data corruption every minute of every day. It would be totally untenable”. A huge tech company might have, what, order of ten million machines? The funny thing is, if you do the math for how many consumer machines there are out there and much consumer software runs on unreliable disks, the math is similar. There are many more consumer machines; they’re typically operated at much lighter load, but there are enough of them that, if you own a widely used piece of desktop/laptop/workstation software, the math on data corruption is pretty similar. Without “extreme” protections, we should expect to see data corruption all the time.

But if we look at how consumer software works, it’s usually quite unsafe with respect to handling data. IMO, the key difference here is that when a huge tech company loses data, whether that’s data on who’s likely to click on which ads or user emails, the company pays the cost, directly or indirectly and the cost is large enough that it’s obviously correct to spend a lot of effort to avoid data loss. But when consumers have data corruption on their own machines, they’re mostly not sophisticated enough to know who’s at fault, so the company can avoid taking the brunt of the blame. If we have a global optimization function, the math is the same — of course we should put more effort into protecting data on consumer machines. But if we’re a company that’s locally optimizing for our own benefit, the math works out differently and maybe it’s not worth it to spend a lot of effort on avoiding data corruption.

Don’t ask if a monorepo is good for you – ask if you’re good enough for a monorepo

But I’m going to make an argument which I think mostly works regardless of the state of tooling on any given day:

  • Monorepo is great if you’re really good, but absolutely terrible if you’re not that good.

  • Multiple repos, on the other hand, are passable for everyone – they’re never great, but they’re never truly terrible, either.

Sum Types Are Coming: What You Should Know

Just as all mainstream languages now have lambda functions, I predict sum types are the next construct to spread outward from the typed functional programming community to the mainstream. Sum types are very useful, and after living with them in Haskell for a while, I miss them deeply when using languages without them.

Fortunately, sum types seem to be catching on: both Rust and Swift have them, and it sounds like TypeScript’s developers are at least open to the idea.

I am writing this article because, while sum types are conceptually simple, most programmers I know don’t have hands-on experience with them don’t have a good sense of their usefulness.

In this article, I’ll explain what sum types are, how they’re typically represented, and why they’re useful. I will also dispel some common misconceptions that cause people to argue sum types aren’t necessary.

DevOps for the Database

What about the database? What is DevOps for the database? Just as with the big-picture definition of DevOps, there are a lot of different right ways to do it, it’s context-specific, but I have my particular viewpoint based on what I’ve seen work well and not-so-well. In this section I’ll do two dangerous things: I’ll introduce a list of capabilities I consider more or less important for DevOps in the database; and I’ll lay them out in a rough progression that you could use as a kind of maturity model if you wanted to. Why is my list, and my thematic progression, dangerous?

  • The list is dangerous because it’s my opinions formed anecdotally. If you want more reliable guidance, you need to rely on science, and there’s no research, data, theory, or science here. The only place I think you can find reliable DevOps science is in the book Accelerate (and other works by the same team). Nonetheless, I think it’s valuable to advance this list as a hypothesis because if I’m even partially right, the benefits are worthwhile, and my experience should count for something even though it’s not science.

  • The thematic progression is dangerous because it smells like a maturity model, and those are problematic, as the Accelerate authors detail on pages 6-8. In brief, maturity models fool you into seeing DevOps as a destination instead of a journey; they present the illusion of linear progress through atomic and well-bounded stages that have invariant definitions in different contexts; they encourage vanity metrics instead of being outcome-focused; and they are static instead of being dynamically defined in terms of an evolving understanding. However, I’m presenting a way to organize my list of DevOps practices because I believe there are benefits to doing so, and many people have requested my opinion about where to start and how to make progress.

Caveats aside, here’s how I articulate my current understanding of what separates teams who do DevOps for the database exceptionally well. I view the following capabilities as important:

  1. Automated database provisioning and configuration management (infra-as-code).

  2. Automated backups, recovery (restore),and continual, automatic backup testing.

  3. Schema-as-code in version control, with “normal” change and review processes.

  4. Migrations-as-code with version control, for both schema and data migrations.

  5. Continuous delivery of database schema migrations and data migrations.

  6. Forwards and backwards application/schema version compatibility (decoupling).

  7. Holistic and detailed (workload- and internals-focused) database monitoring.

  8. Developers have immediate visibility into queries in production.

  9. Developers own the full app lifecycle including production query performance.

  10. Developers own or participate in database performance incident response.

  11. Database-specific skill, knowledge, and processes are spread around the team.

  12. DBAs are shielded from most incidents caused by applications or customers.

  13. DBAs are focused on proactive, strategic architectural/platform/product initiatives.

It’s relatively easy to adopt modern, cloud-native, DevOps practices for disposable parts of the infrastructure, such as web servers and application instances. These are designed to be replaceable and stateless. But databases are stateful, which makes them a lot harder to treat as “cattle, not pets”. There’s not only a lot more to get right, but the stakes are higher — data loss is worse than service unavailability — and it is impossible to make it instantaneous. Data has inertia, so you can’t just install a database and put it into production; you have to import data into it, which takes time. Likewise, you can’t always roll back database changes instantly if a deploy fails, because if state was mutated, it takes time to mutate the state back again.


1. This does imply that tools like valgrind, that work on the final assembly, can never reliably detect all UB.