In the last few years, I’ve had the pleasure of kicking off lots of new reporting relationships with both engineers and engineering managers. Over time, I’ve learned that getting some particular data during an initial 1:1 can be really helpful, as I can refer back to the answers as I need to give a person feedback, recognize them, and find creative ways to support them. Most of these I’ve stolen from some really amazing Etsy coworkers.
At The New York Times we have a number of different systems that are used for producing content. We have several Content Management Systems, and we use third-party data and wire stories. Furthermore, given 161 years of journalism and 21 years of publishing content online, we have huge archives of content that still need to be available online, that need to be searchable, and that generally need to be available to different services and applications.
These are all sources of what we call published content. This is content that has been written, edited, and that is considered ready for public consumption.
On the other side we have a wide range of services and applications that need access to this published content — there are search engines, personalization services, feed generators, as well as all the different front-end applications, like the website and the native apps. Whenever an asset is published, it should be made available to all these systems with very low latency — this is news, after all — and without data loss.
This article describes a new approach we developed to solving this problem, based on a log-based architecture powered by Apache KafkaTM. We call it the Publishing Pipeline. The focus of the article will be on back-end systems. Specifically, we will cover how Kafka is used for storing all the articles ever published by The New York Times, and how Kafka and the Streams API is used to feed published content in real-time to the various applications and systems that make it available to our readers. The new architecture is summarized in the diagram below, and we will deep-dive into the architecture in the remainder of this article.
As researchers in language design and related topics, we face a recurring question: how much ought we to idealise the rocks and waves inhabited by practitioners? We can idealise not only how gadgets work (machines, software), but also why particular trade-offs win out in particular cases — including why people use C. There is much valuable work to be done in idealised terms, on designs, or in clean-slate fashion, or atop theoretical rather than practical substrates. But the balloonist’s view does not seem to explain the endurance of C, given its attendant problems; the sea-farers seem to live in some other, incommensurable reality. To understand this phenomenon, we will have to think differently. Over the remainder of the essay, I’ll argue the following points (each in its own section), sharing the theme that we need to think less about languages as discrete abstractions, less hierarchically in general, and more about the systems which embody languages — seeing language implementations as parts of those systems, and not shying away from contextual details associated with implementation concerns and non-portability. Despite this prevailing preference for talking and thinking about languages in a discrete sense, I will also note certain ways in which we have the habit of confusing C’s implementations with the language itself.
Rewritable software is a term coined by Jonathan Rees for software that is hard to write but then easy to rewrite.
The software is hard to write in that you spend years patiently writing code, experimenting with complicated ideas, and exploring the problem space. You wrestle with intricate problems, many of them dead-ends, and pull heroic all-night debugging sessions. Gradually though you discover the essense of the problem you are solving and you eliminate the accidental complexity.
The software then is really simple and easy to understand. The complex ideas are conspicuous in their absence. People can read the code, understand it, and write it again themselves. “Is that all there is to it?”
The classic example is perhaps John McCarthy’s Lisp interpreter written in Lisp. Jonathan Rees and Richard Kelsey also wrote Scheme48 with this explicit goal: “the name derives from our desire to have an implementation that is simple and lucid enough that it looks as if it were written in just 48 hours”.
Snabb Switch aspires to be rewritable software too. We are wrestling with all kinds of complexity: writing device drivers, bypassing operating systems, mapping memories of virtual machines, exploring obscure features of the latest CPUs, and interoperating with many and varied pieces of black-box network equipment.
If we do our job well then after years of intensive development people will be able to read the code and think, “Is that all? I could rewrite that in a weekend”.