Chosen links

Links - 10th March 2019

Identity by any other name

The fascinating thing about identifiers is that while they identify the same “thing” over time, that referenced thing may slide around in its meaning. Product descriptions, reviews, and inventory balance all change, while the product ID does not. Reservations, orders, and bookings all have identifiers that don’t change, while the stuff they identify may subtly change over time.

Identity and identifiers provide the immutable linkage. Both sides of this linkage may change, but they provide a semantic consistency needed by the business operation. No matter what you call it, identity is the glue that makes things stick and lubricates cooperative work.

For a long time, we worked behind the façade of a single centralized database. If you wanted to talk to other computers, that was an “application problem” and not in the purview of the system. Data lived as values in cells in the relational database. Everything could be explained in simple abstractions, and life was good!

Then, we started splitting up centralized systems for scale and manageability. We also tried to get different systems that had been independently developed to work together. That created many challenges in understanding each other4 and ensuring predictable outcomes, especially for atomic transactions.

As time moved on, a number of usage patterns emerged that address the challenges of work across both homogeneous and heterogeneous boundaries. All of those patterns depend on connecting things with notions of identity. The identities involved frequently remain firm and intact over long periods of time.

When your partner business uses identifiers in a non-immutable way, you need to be on your toes.

When your partner business uses identifiers in a non-immutable way, you need to be on your toes.

The identity captured in the URL is a large part of why REST is so powerful. The underlying resource has an identity in the URL namespace. Each representation (assigned to a single user) has an identity in the URL namespace. Specific operations are captured, leveraging identity within the URL namespace — a powerful mechanism using the identity of the URL!

Identities must be scoped in space and time so that they don’t cause ambiguities. This is, on the one hand, an obvious and silly thing to say. On the other hand, it is a liberating concept.

Identifiers may have permanent unique IDs like those offered by UUIDs. These are powerful and useful. Identifiers may have a centralized or hierarchical authority that assigns their IDs, and that, by itself, offers challenges: Does this authority scale? Is it broad enough in its role to encompass the many different pieces of the solution?

The reality for most systems is that identities span the participants that see the use of that specific identifier. When merchants interact with a big ecommerce site, they will have shared identifiers for their cooperative work. Still, the merchants may not share the identifiers they use to deal with private suppliers. Those private suppliers may have different identifiers used to interact with the manufacturers of their products.

The scope of the identifiers is typically subject to the portion of the workflow that hosts the identifier. There are global IDs like UPC or SSN (Social Security number), but there are also local IDs like SKUs that are defined only for a single merchant.

Data-centric programming for distributed systems

Modern data-intensive systems are increasingly distributed across large collections of machines, both to store and process the rapidly-increasing data volumes now produced by even modest-sized enterprises and to satisfy the growing hunger of analysts for “big data”. Distributed systems are difficult to program and reason about because of two fundamental sources of uncertainty in their executions and outcomes. First, due to asynchronous communication, nondeterminism in the ordering and timing of messages delivery can “leak” into program outcomes, leading to data inconsistencies. Second, due to partial failure of components and communication attempts, programs may compute incomplete outcomes or corrupt persistent state. Traditional solutions (e.g., distributed transactions) that hide these complexities from the programmer are considered by many to be untenable at scale, and are increasingly replaced with ad-hoc solutions that trade correctness guarantees for acceptable and predictable performance.

The challenges of programming distributed systems are exacerbated by the fact that they are no longer the sole domain of experts. The relatively recent accessibility of large-scale computingresources (e.g., the public cloud), and proliferation of reusable data management components (e.g., “NoSQL” stores, data processing frameworks, caches and message queues) have created a crisis: all programmers must learn to be distributed programmers. Few tools exist to assist application programmers, data analysts and mobile developers to struggle with these tradeoffs

We hypothesize that many of the challenges of programming distributed systems arise from the mismatch between the sequential model of computation adopted by most programming languages — in which programs are specified as an ordered list of operations to perform on an ordered array of memory cells — and the inherently disorderly nature of distributed systems, in which no total order of events exists. Nondeterminism in message delivery order can cause distributed programs to produce nondeterministic results, which complicates testing and debugging. For programs that replicate state, this nondeterminism can lead to replica divergence and other consistency anomalies. Exerting control over delivery order to ensure that sequential programs are executed in lock-step can incur unacceptable performance penalties in a distributed execution.

At the other end of the spectrum, purely “declarative” languages like SQL — which shift the programmer’s focus from computation to data — are set-based and lack the ability to even express ordered operations. In exchange for this limited expressivity, statements in such languages can besafely evaluated in a data-parallel, coordination-free manner.

Disorderly programming — a theme that we explore in this thesis through language design — extends the declarative programming paradigm with a minimal set of ordering constructs. Instead of overspecifying order and then exploring ways to relax it (e.g., by using threads and synchronization primitives), a disorderly programming language encourages programmers to underspecify order, to make it easy (and natural) to express safe and scalable computations. As we will show, disorderly programming also enables powerful analysis techniques that recognize when additional ordering constraints are required to produce correct results. Mechanisms ensuring these constraintscan then be expressed and inserted at an appropriately coarse grain to achieve the needs of coretasks like mutable state and distributed coordination.

The RSA 2019 Conference

Sadly, the lack of foundations for the people at most of the booths mirrored the lack of a solid foundation for the products. There are some good, useful products and services present on the market. But the vast majority are intended to apply bandaids (or another layer of virtualization) on top of broken software and hardware that was never adequately designed for security. Each time one of those bandaids fails, another company springs up to slap another on over the top. That then leads to acquisition and integration into security suites. No one is really attacking the poor underlying assumptions and broken architectures. This is related to why I don’t submit proposals to talk at the conference — I tried a few years ago and the message conveyed to me was that it was out of step with what the sponsors wanted presented. The industry is primarily based on selling the illusion that vendors' products can — in the right combination and with enough money spent — completely protect target systems. Someone pointing out that this is fundamentally flawed is not a welcome addition. I get that a lot — it is probably why I don’t get asked to be a company advisor, either.

Single-file cross-platform C/C++ headers implementing self-contained libraries

Besides, descriptive variable names violate the principle of “don’t repeat yourself”. It is better to describe the variable only once, in a comment next to its declaration, instead of several times