Why query planning for streaming systems is hard
Why can’t we just use the same techniques for streaming queries as we do for batch queries?
Queries are long lived. The distribution of data might change over time, and we might not even have any data when we first install the query. A plan that was optimal yesterday might be a disaster today.
Queries are stateful. A faster plan might also require more memory. Switching to a different plan while running can incur expensive work to build the state required for the new plan. In the worst case, if historical input isn’t still available then we might not be able to compute the new state at all.
Optimization has multiple objectives. In the classic database model the only thing we are optimizing for is total runtime. In a streaming system we can tradeoff between throughput, latency, memory usage, bandwidth, installation time, failure tolerance etc. In elastic systems this is even more complicated because our resources aren’t fixed — maybe the faster plan also costs more money.
There are multiple queries. Queries can contend with each other for resource eg a shuffle-heavy plan might cause delays for other queries on the same network. At the same time, we can also consider sharing resources like indexes or even entire sub-graphs between queries. To best take advantage of this we need to optimize across multiple queries simultaneously. It might sometimes make sense to choose plans that are individually suboptimal if it allows more sharing between them. But this means that the cost estimate is no longer linear — the total cost of two plans is not just the sum of the costs of each plan — which makes the search problem much harder.
Data statistics are more complicated. It’s not enough to just know the distribution of data at any one time. We also need to know how it evolves over time. Eg consider a table that grows continuously vs a table that stays small but has a high rate of change — they may have the same number of incoming updates but we should plan very differently for these cases.
Plans may get very deep. It’s increasingly common to stack views on views on views. Unfortunately combining cost estimates exactly is currently intractable so we rely on simplifying assumptions (eg no correlation between different predicates) that introduce some degree of error. The deeper the plans get, the more these errors multiply.
I forgot how to spellcheck
The solution to every intellectual problem lies between two opposites: “with understanding” and “with experience”. Most of the time, we use a mix of both, though one can imagine “pure understanding, no experience” and “pure experience, no understanding” ways of solving.
When implemented in software, understanding-based (analytical) solutions require a developer to understand the domain fully and reflect it in the code; while experience-based solutions can be created by fitting a few heuristics matching most important cases and then iteratively changing these heuristics (or adding new ones) when new cases are uncovered. This iterative fitting can be performed by the developer or by another algorithm — the latter we now call “machine learning” (that’s a criminal oversimplification, but it is enough for the current context).
There is one common confusion related to using the word “model” in ML context — which isn’t anything near “models” of the problem domain in the analytical sense. I’d really prefer we name those “experiences”; it would be much more appropriate (“download those experiences of image recognition trained on that dataset, or gather custom experiences from a different dataset by running this training script”), don’t you think?
Experience works better than understanding in two base cases (which can be seen as one and the same): i) either the full understanding is too complex/unclear for human to model it in code (image recognition), ii) or there is no underlying “analytical model” in the target domain, just an infinite list of edge cases (say, bank scoring).
The Rise and Fall of the OLAP Cube
One of the biggest shifts in data analytics over the past decade is the move away from building “data cubes”, or “OLAP cubes”, to running OLAP[1] workloads directly on columnar databases.
The decline of the OLAP cube is a huge change, especially if you’ve built your career in data analytics over the past three decades. This is a huge change, especially if you’ve built your career in data analytics over the past three decades. It may seem bizarre to you that OLAP cubes — which were so dominant over the past 50 years of business intelligence — are going away. And you might be rightly skeptical of this shift to columnar databases. What are the tradeoffs? What are the costs? Is this move really as good as all the new vendors say that it is? And of course, there’s that voice at the back of your head, asking: is this just another fad that will go away, like the NoSQL movement before it? Will it even last?
This essay seeks to be an exhaustive resource on the history and development of the OLAP cube, and the current shift away from it. We’ll start with definitions of the terminology (OLAP vs OLTP), cover the emergence of the OLAP cube, and then explore the emergence of columnar data warehouses as an alternative approach to OLAP workloads.
“"So what is OLAP?” you might ask. The easiest way to explain this is to describe the two types of business application usage. Let’s say that you run a car dealership. There are two kinds of database-backed operations that you need to do:
You need to use a database as part of some business process. For instance, your salesperson sells the latest Honda Civic to a customer, and you need to record this transaction in a business application. You do this for operational reasons: you need a way to keep track of the deal, you need a way to contact the customer when the car loan or insurance is finally approved, and you need it to calculate sales bonuses for your salesperson at the end of the month.
You use a database as part of analysis. Periodically, you will need to collate numbers to understand how your overall business is doing. In his 1993 paper, Codd called this activity “decision-making-support”. These queries are things like “how many Honda Civics were sold in London the last 3 months?” and “who are the most productive salespeople?” and “are sedans or SUVs selling better overall?” These are questions you ask at the end of a month or a quarter to guide your business planning for the near future. The first category of database usage is known as “Online Transaction Processing”, or “OLTP”. The second category of database usage is known as “Online Analytical Processing”, or “OLAP”.
Or, as I like to think of it:
OLTP: using a database to run your business
OLAP: using a database to understand your business
Why do we treat these two categories differently? As it turns out, the two usage types have vastly different data-access patterns.
Has UML died without anyone noticing?
Whilst some people employ lightweight modeling techniques such as C4, most diagrams in use today, are what I call, somewhat derogatorily, masala diagrams. No hard feelings, I call my own diagrams like this. Why masala? Because they are informal; they cover multiple dimensions at once, they may be both structural and behavioural, logical and physical. They are often a mishmash of the 4+1 architectural model’s views.
Multi-million systems, upon which your life and finances depend, are devised, funded and executed entirely on the back of said masala diagrams, with often no more additional collateral than a bunch of epics and user stories.
Ernie, surely my bank’s mortgage system hasn’t been architected using one of these dreadful masala models that you describe.
That can’t be right? Yes, if your bank isn’t running CICS, and the solution was sold last year, it is very likely that it had been entirely conceived using the very same kind of masala diagrams that I am talking about.
Has the world gone mad? No, it’s just that we have just given up on the engineering side of software. It is just a coding affair for now. I am not saying that those who write software aren’t engineers themselves; they largely are. The point is that, at the organizational level, software isn’t being engineered any longer, as per the equivalent processes and artifacts found in disciplines such as mechanical engineering. Boeing would never order a jet engine from Rolls Royce on the back of a set of said informal masala diagrams.
Masala diagrams have a role though. If you take them for what they are, they can be beautiful things. See, they aren’t specs. Their purpose is to evoke emotions. Masala diagrams are valuable when they spark joy in the heart of every stakeholder the diagram is intended for, as per Marie Kondo’s principle.
A primer on engineering delivery metrics
Unfortunately, metrics have been used to judge individual performance in ways that have negatively impacted employees beyond their actual performance, such as using lines of code. These poor management practices have eroded the trust between management and collaborators, and it’s normal for engineers to approach metrics with skepticism.
To successfully convince your team to adopt delivery metrics, you must have an obvious purpose for the metrics and solid reasoning for the outcomes you seek. Some questions you should be able to answer to your team members about your intention of measuring the delivery process can be:
Why do we need metrics?
What are we going to measure?
Who will have access to these metrics?
Why did we pick these metrics, and which others could we have chosen?
What is our plan to move these metrics forward?
How will these metrics impact individuals?
Engineering managers should hold their problem-solving abilities and reasoning to the same standards for engineers in their organization.