Johann Schleier‑Smith

The myth and the reality of the 10x engineer

Fri, 30 Oct 2020 00:00:00 -0700

The notion of the “10x engineer” has become a fixture of Silicon Valley lore. It stems from the common observation that there can be tremendous variation in how long it takes different people to complete a task, and that some people seem to get things done quickly on a consistent basis.

Is there really a special gift or talent that a few people have that others don’t? Can 10x engineering be taught? Is there something special about software that makes 10x achievement possible in this field, say as opposed to in other fields such as aeronautical engineering.

Skills and performance

10x performance isn’t a myth, and it’s not that difficult to understand. However, doing so requires a shift in mindset from focus on individual to focus on the task. See Figure 1 below.

Figure 1. For tasks well below someone’s skill level the time to complete the task (normalized to scope) is independent of skill required. As the required skill approaches the individual skill level, time to complete takes asymptotically longer.

The key takeaways is that when the difficulty of a task bumps up against our level of skill, things take longer and longer. If someone takes on a task that’s beyond their level of skill, they probably won’t be able to do it at all.

I’ve oversimplified Figure 1 in various ways. Skills can’t actually be measured along one dimension, and individual limits vary by type of skill. Most glaringly, I ignore learning. Figure 2 addresses this last shortcoming.

Figure 2. People can learn to do tasks requiring skills beyond their initial level, but learning takes time, and proficiency is low initially.

Once you bump up against your skill limit, you’re almost certainly going to accomplish more if you sit down and learn, than if you try to muddle through.

I’ve had the painful experience of tackling a task beyond my level of skill countless times. For example, when I first tried to do data science in Python, I figured, “I know how to program, I’ll just dive into my project and look things up in the reference documentation as I go.” Three days later I gave up in frustration and went back to a familiar technology. On my second attempt I first worked through tutorials, making sure I really understood how to think about basic data structures and how to manipulate them. With the benefit of this foundation, I started to make quick work of various tasks.

When skill helps and when it doesn’t

One corollary of the skills-based model of performance is that having more skills doesn’t make you any better at mundane tasks. For example, a senior engineer with PhD in machine learning is probably going to need about the same amount of time to complete a data cleaning tasks as an entry-level analyst. Having more skills doesn’t help when the task doesn’t require them.

Google’s Jeff Dean is the most famous engineer I can think of. Who else has reached internet meme status? To many, he is viewed as a 100x engineer. Jeff’s is best known for his public work, which includes the MapReduce paradigm that launched the Big Data movement, and the Tensorflow technology that helped make AI mainstream. What’s readily apparent about someone like Jeff is that he has that he has been able to solve problems that the vast majority of engineers, even Google engineers, could never solve. His skills give him a tremendous advantage for at figuring out how software should be written in a new application domain like AI, yet they would give him only a modest advantage at designing an ordinary enterprise business application.

Implications and advice

In my view, too many engineers today are operating up along the asymptote of Figure 1. Most of them are unaware of this, as are their managers. My advice to improve this situation follows.

Tips of for engineers:

Assess honestly how your skills compare to those needed for the task.
Commit to learning any missing skills up-front. You don’t have to learn everything up-front, but identify what you will learn, make the commitment to doing so, and set aside the necessary time.
It’s good to push yourself to learn, but also push back on assignments when the skill gap is too great.

Tips for managers:

Learn to recognize when someone is struggling because a task is above their level of skill. There’s no use in driving someone to deliver when they should either be reassigned or asked to take a step back and spend time learning.
Seek to hire quick learners. Fast learning is a wildcard skill that helps every time a task requires missing skills. How do you spot a quick learner? It’s easy, they’ve learned lots of stuff.
Pay people to learn relevant skills. Doing so will pay off.

You can find various other advice on how to become a 10x engineer or how to hire one. A common theme is the emphasis on focus, prioritization, and time management. These are all contributing factors, and could be good for 2-3x. In my view, however, skills the longest lever and often overlooked.

Looking ahead

I look forward to the day when 10x engineers are mostly a thing of the past. We won’t talk about elusive 10x engineers anymore because most engineers will be tackling tasks that are well within their ability. We will take for granted that the average engineer is 10x as productive they were back in 2020.

How will this happen? The tips listed above can help by making everyone more aware of skill gaps, and by encouraging investment in filling them. Better engineers and better managers are important, yet so is better technology.

Imagine if we tried to write all of the software being written in 2020 using the tools available in 1990, a time when popular and high-productivity programming languages such as Python and Java had yet to be invented, and when many applications stored data in files rather than modern databases. Productivity likely would be less than 10% of what it is today.

One example of a technology that promises to vastly simplify certain engineering problems is serverless computing, which hides the complex operating model of today’s cloud behind simplified programming abstractions. Even experts appreciate the benefits of simplified programming because it frees up time and energy to focus on the most important aspects of a problem, usually delivering unique business value, rather than on undifferentiated technical details. Simplified programming is also an equalizer and a lubricant, allowing more people to build software without first learning highly specialized skills, as well as to jump readily from one domain to another.

There will still be star engineers in the future, however I expect that most companies won’t need them. The combination of better understanding of skill gaps, up-skilling to fill those gaps, and simplified programming models will mean that most programmers will be able to handle most tasks without reaching beyond the limits of their skills.

Filling in a four-year gap

Mon, 31 Aug 2020 00:00:00 -0700

This space has been silent for almost four years. What happened? Was I on a long vacation? Did I have nothing to share? Or was I just really busy? In the coming weeks I’m going to be adding some new posts, but before I do so I’d like to fill in the gap just a little bit.

It has in fact been a busy time, both personally and professionally. When I last posted my wife and I had no children, whereas now we are blessed to have two. Those of you who have raised children you know it’s both amazing and incredibly demanding. Those of you who might do so in the future should mark it down as something to look forward to doing when you have a lot of extra time.

Now for the professional highlights. I earned my MS in Computer Science from UC Berkeley in 2016, submitting the ReStream project as my thesis. Berkeley is a wonderful community, I came there to learn, and I’m happy to say that I learned a lot. Still, my thirst was for not quenched.

In 2017 I returned to the department, beginning a PhD that I’m still actively working on. What motivated me then is a question that’s been bugging me ever since I faced the challenge of scaling Tagged, a social network that I co-founded. I still couldn’t understand: why is programming at scale is so hard?

To most professional programmers it seems obvious. When you need to operate at scale you need a lot of computers. When you have a lot of computers you need to hook them all together, keep them all running well, and think about how they all should interact.

Moreover, prominent Silicon Valley companies such as Google have for years advertised their scale to attract top technical talent. As they tell it, scale not only multiplies the impact of your work, it also presents challenges that only the smartest engineers can solve.

This point of view resonated with me—I had hired very bright people who wrote lots of clever code that was used by hundreds of millions of end users. Still, I grappled with something fundamentally unsettling: the products that we were building just weren’t all that complicated. I didn’t work on Twitter, but viewed it as a prominent example. The original version was built in just a few weeks by a small team, but the company would employ hundreds just to ensure its ability to scale.

I will reserve a fuller exploration of this puzzle to a future post. For now I merely mention that others have also questioned the conventional assumption that scale has to be hard. Notably, serverless computing has emerged as a layer on top of the traditional cloud that simplifies its programming and operating model. Rather than working with virtual servers, programmers simply upload their code, hook it up to an API link or other trigger, and pay for the time that it runs. Launched in its contemporary form by Amazon Web Services in 2014, serverless has seen an explosion of interest in industry and academia.

It took me several years to come to the realization that serverless computing offered the most direct answer to the problem I was focused on. Programming at scale doesn’t have to be hard, it’s just hard when you need to deal with the complexity of lots of servers.

In a nutshell, I’ve been working on understanding and improving serverless computing, even before I knew to call it that. Serverless in its present form remains practical for certain limited applications, and figuring out how to bring the simplicity of programming at scale to a broader range of problems is a fascinating challenge. I look forward to sharing some of the ideas that I’ve been working on.

Multi-Versioned Parallel Streaming in ReStream

Sun, 20 Nov 2016 07:00:00 -0800

This post summarizes work published in SoCC ‘16, ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing (also here).

When real-time isn’t fast enough

When building real-time data-driven applications developers are continually challenged by a seemingly simple question, “what will the new software do when we deploy it?” This is a challenge I encountered several times while building the if(we) social networks. When deploying a new anti-spam rule, for example, one wants to know how it performs given the logs of the past 30 days as input, but one does not want to wait for 30 days to have an answer. An hour-long replay job might be acceptable, but results in a few minutes would be much better.

Similar needs for backtesting can be found just about anywhere people are developing real-time applications. I found it essential when creating real-time personalized recommendations; it was key to enabling an Agile practice in machine learning and data science. It is a routine procedure in finance, and the need for it abounds in other areas including security, advertising, e-commerce, and emergent applications such as IoT.

For developers doing backtesting, real-time replay is much too slow. A speedup of 1,000x is commonly desired, and even 10,000x can seem reasonable to ask for. ReStream is a proof-of-concept system that shows how accelerated replay can be achieved.

What won’t work

The traditional goal of stream processing systems has been low latency, minimizing the delay between the arrival of an event and the system’s response to it. Backtesting and replay demand high-throughput, which minimizes overall job completion time across millions or billions of sequential events. For workloads that partition naturally it can be simple to scale out a streaming system to achieve accelerated replay, but for many workloads no such partitioning exists.

There are many mature industrial-strength parallel database systems, so we could imagine partitioning our input logs among a number of clients, each of which submits a stream of transactions. However, the consistency and isolation guarantees provided by the database do not ensure proper sequencing of events. At 1,000x acceleration events processed in one second could have occurred more than 15 minutes apart. Close synchronization among clients can help, but unless we also control the commit order of concurrent transactions the outcome of accelerated replay could be quite different from that of real-time processing.

Serial-equivalent parallel processing

What is needed for full-fidelity accelerated replay is “serial-equivalence.” Suppose we are given a program that consumes input from a log, sometimes producing output and sometimes updating its internal state as it goes along. We want to achieve an identical output from a number of parallel programs, possibly communicating with one another, each consuming a partitioned portion of our original input log and producing outputs along with corresponding log sequence numbers.

Serial-equivalent parallel processing is what developers attempting accelerated replay of real-time applications need.

ReStream and Multi-Versioned Parallel Streaming

Achieving serial-equivalent parallel processing relies on a key observation: causal dependencies between processing of log records are often sparse. For example, messaging activity on a social network generally exhibits geographic clustering, so we can imagine an anti-spam system analyzing communication patterns within regions in parallel, coordinating across them when necessary. The Multi-Versioned Parallel Streaming (MVPS) provides the sort of loose coupling that allows computations to proceed with parallelism, enforcing order when necessary, but not necessarily otherwise.

A multi-versioned state store is a key building block for MVPS. It presents an API of a key-value store, with the further requirement that all reads and write operations must be accompanied by a timestamp. Reads into the past are allowed, with the multi-versioned state returning the value of the write with the timestamp that most closely precedes the read timestamp. Writes to the future are also allowed, but writing into the past is not; writing to a key at a timestamp less than the greatest read timestamp serviced at that key is an error.

MVPS combines multi-versioned state with program analysis and a runtime that ensures that writes always precede reads at a given key. The example that follows serves to illustrate how it does this.

Anti-spam example

As a simple example, consider a multi-part rule for a social network anti-spam system. We flag a message as possible spam when two conditions hold: 1) the sender has been sending to non-friends more than to friends by a factor of 2:1, and 2) the message originates from an IP address for which more than 20% of messages sent have contained an e-mail address.

The following code shows a reactive program implementing this algorithm using a simplified Scala syntax:

ReStream performs static analysis on the program to obtain a graph of potential dependencies between reactive event handlers, mapping out the multi-versioned state variables that might be written to or read from by each. Commonly these form a Directed Acyclic Graph (DAG), a key requirement that enables accelerated replay.

A topological sort reveals the DAG structure in the program and allows ReStream to provide partitioned parallelism. What’s most important is that ReStream never writes into the past of a multi-versioned state variable—it’s fine to create multiple parallel instances of all operators A, B, C, and D. Individual instances of A need not coordinate amongst themselves and can process partitioned input in parallel. What’s important is that no instance of B gets ahead of some instance of A in processing the input log. That might cause writes into the past at the friendships multi-versioned state. Similar relationships hold for other state variables. ReStream implements a barrier mechanism to enforce the invariants guaranteeing safe processing writes to multi-versioned state objects.

The diagram below illustrates ReStream in action with a partitioned log given as input. There are two copies of each operator, one of which receives odd-numbered events and the other of which received even-numbered events. So long as events flow to operators in accordance with the computation’s DAG structure multi-versioned state ensures serial-equivalent results.

Architecture

The ReStream architecture looks similar to many other systems with distributed parallelism.

A centralized driver program collects progress from operators on each node and adjusts barriers as they advance. Multi-versioned state objects are distributed across worker nodes, but are globally accessible via reads and writes originating at any node. Each worker also receives a portion of the log. Note that no special input partitioning is required, we can distribute log entries arbitrarily. We do, however, require ordering within each partition as well as a global total order (i.e., a global log sequence number).

ReStream uses mini-batches to help amortize various overheads. These include the cost of coordinating progress at the driver program, as well as the latencies of reads and writes to multi-versioned state.

Evaluation and limits to scalability

ReStream shows good scalability as the number of machines grows, with throughput increasing by roughly 70% for each hardware doubling. Importantly, ReStream quickly outperforms an efficient single-threaded replay implementation, readily justifying the overheads of distributed computation and of maintaining multi-versioned state.

What limits ReStream’s scalability? We often think about the parallelism inherent in a program, but for data-driven applications the input matters as well. It is the interplay of input data and program execution that generates reads and writes to shared state, and it is these reads and writes that represent the causal dependencies that limit parallelism.

The graph below shows the throughput of ReStream running an anti-spam task on synthetic social network data. The power law scale parameter α describes the degree distribution of activity on the network, which varies across a range of practically relevant values. As activity becomes more concentrated around fewer users, as might be the case with celebrities on Twitter, ReStream’s ability to provide parallelism diminishes.

The solid lines represent a simple but effective model for ReStream’s throughput given by the formula below.

Here the term (hosts-1)/hosts models the cost of non-local communication. A specially instrumented version of ReStream logs all reads and writes to multi-versioned state. This model shows that when the critical path length through read-write dependencies exceeds a multiple of the per-host batch size, gains to performance are limited. This is represented by the term max(critical path length / per-host batch size, a). The model fits the measurements with a=1.85 and a quality of fit R²=0.94 across the range of per-host batch sizes from 2,500 to 40,000.

Summary

ReStream introduces Multi-Versioned Parallel Streaming (MVPS), a way of maintaining causal consistency with loose coupling. MVPS enables serial-equivalent parallel replay, a key need for backtesting of real-time applications, which can demand processing months of logs in hours or even minutes.

The traditional goal in stream processing is low latency. To my knowledge, ReStream is the first streaming system developed with the principal aim of high throughput. The capability for much-faster-than-real-time streaming can also be used to generate fine-grained training data for machine learning applications, or in scaling online stream processing.

The present implementation of ReStream is a research prototype. I invite anyone interested in building a production-ready MVPS implementation to contact me.

This work was supported in part by AWS Cloud Credits for Research.

Event history architecture

Wed, 31 Aug 2016 07:00:00 -0700

It’s been a over year since I wrote here about Agile machine learning, work presented at KDD 2015 and based on my experiences at if(we). Looking back with the benefit of a bit of distance and looking ahead to problems of coming years, I’m compelled to comment on one aspect of the work that seems particularly relevant to the needs of future applications: the event history data architecture.

One of the keys to success in creating quick development cycles in machine learning is changing how we think about state. It may be most natural and common to update our databases and services to reflect the current “now” state of the world. This representation is important for real-time systems, which need to produce results in a fraction of a second using the most current facts available, but it poses other challenges. Training and backtesting models requires a sort of “what if” scenario analysis, one that uses historical data in new and different ways. Most update-oriented approaches to state require separate tracking or perhaps snapshots to reconstruct past versions of their state. Even then, supporting new features may be challenging.

A better approach is to log events or facts as they happen, then to compute any other necessary state by processing this log.

There’s not much to the basic Event interface:

interface Event {
  timestamp: Long
}

One can think of this as a reductionist approach—just write down everything that happens, as it happens, and you’re guaranteed to be able to compute anything you might fancy in the future. A shared log is the central element of the event history architecture.

The following figure shows how both production development systems can draw from the same log, the content of the Event History Repository.

The same code can run in the the production environment Real-Time State Updates and in the development environment State Updates boxes. In production, state updates occur in services that power the live system, providing input for ranking and recommendations, or just simply data for display. In development, state updates occur in a simulation context, making it possible to revisit past events to generate inputs for training or evaluating models. The same approach also can be used to generate reports for inspection by developers or analysts.

The key to this unification is the simple Event History API:

interface EventHistory {
	def publishEvent(e: Event)

	def getEvents(
		startTime: Date,
		endTime: Date,
		eventFilter: EventFilter,
		eventHandler: EventHandler)
}

The second method, when called with a finite end-time, is suitable for revisiting activity from a past time period. When called with +∞ as end-time, it streams data to the event handler, which is just what is needed for real-time production deployments.

A few months ago Netflix published a detailed blog post describing Distributed Time Travel for Feature Generation. They highlight the value of real-time recommendations and list a concise set of requirements for developing the necessary models. Among these, they desire a system that “accurately represents input data for a model at a point in time to simulate online use.” Netflix developed an robust mechanism for recording snapshots of state from across the numerous services that provide input to their recommendations. As an upgrade to established infrastructure their solution makes sense, but an event history architecture could be a good future direction.

Implementing an event history architecture today is challenging because platform support for the API is limited. However, there are some technologies that can show us the way:

Event Store - perhaps the only database designed from the ground up to ingest state as a series of events. Its lead developer is Greg Young, who has been a proponent event processing as a central application abstraction.
Confluent - a commercial extension of Apache Kafka. Kafka was created by Jay Kreps, author of The Log: What every software engineer should know about real-time data’s unifying abstraction, one of my favorite blog posts.
Antelope Realtime Events - my demo implementation of tools for feature engineering and for building real-time recommendations using an event history API. This is based on work done at if(we). It is not a production implementation and I am no longer developing it.

I would be thrilled to see a really robust and practical implementation of an event history database. One can build this today by combining existing streaming and data storage (e.g., Kafka + HDFS), but a purpose-built system would be better still, one that implements the event history API directly.

There’s also plenty of room for interesting research. For example, I’m sure we can learn a lot from trying to apply event history in the context of a various different of machine learning techniques, including deep learning. Another intriguing problem is how to take a log and roll it forward quickly and efficiently, either generating intermediates or to jumping ahead to the present, as when creating a new view of history. I have done some work in this area recently, which only convinces me more that this will be a fruitful direction.

The production needs of real-time machine learning systems call for a streaming, incremental, event-based approach to processing data. Capturing these events in a log and replaying them brings twin benefits: allowing shared code between production and development and providing a time-travel capability. The event history API provides a unified point of access to streaming events, both real-time and historical, and deserves to serve as the foundation for modern data-driven intelligent systems.

I’d love to hear about it if you’ve implemented this approach or if you’re considering something similar.

Five questions to ask when starting a company

Sun, 10 Jan 2016 21:00:00 -0800

In 2004 my co-founder, our small team, and I launched a product in a new category, one that would soon become widely known as social networking. While the company we started, if(we), was eventually far outpaced by Facebook, which launched only months before, we have a lot to be proud of: We served hundreds of millions of members on our products, first on Tagged and later on hi5; we were the first social network to become profitable, a status that we have maintained since 2008; we were in early on a space that would see Silicon Valley’s biggest success of a decade.

How did we know to go all-in on building a social network, early on, before the promise became apparent? There is no sure-fire answer for finding “the next big thing,” but the questions below reflect, with some benefit from hindsight, the analysis my co-founder and I did before dropping out of the Physics Ph.D. program at Stanford. It is not an exhaustive list, and we should never discount factors such as luck, but I believe it reflects some important criteria that can indicate when a startup is set up to capture a massive opportunity.

Five questions to ask when starting a company

What are the trends? - For us, the key trend in 2004 was more eyeballs on the internet and fewer on television. We concluded that advertising revenues must eventually follow. In addition to this multi-decade trend, there was the digital photography revolution, which made putting photos online much easier. It’s important to look for both short-term and long-term trends. The former create opportunities and are critical for initial traction, whereas the latter sustain growth over time.
What are the network effects? - The best social network is the one that your friends are on. Communication tools, marketplaces, ridesharing products, platforms, and businesses of all sorts benefit in varying degree from this form of increasing returns to scale. Network effects often create a winner-takes-all dynamic, or something close to it, and many of today’s most lucrative businesses profit because of strong network effects.
What are the pain points? - Finding old friends, sharing fresh photos, giving and receiving love and validation, these are all human problems that social networks address. Skype built its success on saving people money on international calls, WhatsApp on saving SMS fees. Uber solved the often acute challenge of finding a cab. Occasionally great companies come about from providing for needs that people didn’t know they had, but more often there is some really obvious problem that they solve.
Where is the magic? - Magic is that which belies belief, that which at first seems impossible, but that we quickly can come to take for granted. Consider entering a few words into a simple web form, hitting the search button, and getting the result you’re looking for, in a fraction of a second, on the first try, and from among billions of possibilities; what could be more magical? Keeping up with all of your family and friends on a daily basis? Calling half-way around the world for free? Watching a car come to pick you up and knowing exactly when it will be there? Magic reshapes our view of the world.
Why us? - Do you and your team members have the right skills to win in this space? Do you care enough about the problem to make the sacrifices of startup life, to push past the inevitable low points? Will you, when matched against talented competition, end up the winners?

If you don’t have awesome answers to all five of the questions above, don’t despair. For example, if network effects are weak then perhaps intellectual property can protect against competitors. Plenty of good businesses can answer only three or four of these questions well, and others lack answers early-on but discover them over time. The more compelling your answers are, though, the better your chances are of starting something really, really big.

Analyzing a read-only transaction anomaly under snapshot isolation

Wed, 06 Jan 2016 18:00:00 -0800

Every now and then I learn about something so surprising, counter-intuitive, or intriguing, that I just have to see it for myself. The 2004 SIGMOD paper by Alan Fekete, Elizabeth O’Neil, and Patrick O’Neil, “A read-only transaction anomaly under snapshot isolation,” ticked all of those boxes for me.

In what follows, I’ll dissect this research finding, describe my efforts to reproduce it in popular databases, and reflect on how it informs practical decisions.

Snapshot isolation

Snapshot isolation is perhaps the most successful approach to providing concurrency with consistency guarantees in relational databases. The idea is simple: once a transaction starts it sees a view reflecting all updates committed to the database before it began, but none of the effects of other in-progress transactions, or of other transactions that commit later on in the course of its execution. Snapshot isolation is the dominant implementation of Multiversion concurrency control (MVCC), and most modern high-performance rely upon some form of it (e.g., Oracle, DB2, PostgreSQL, Microsoft SQL Server). It can eliminate much of the lock contention that otherwise plagues databases, answering reads instantly with non-blocking execution and providing consistent views of the database to long-running read-only transactions, even as updates and queries of new data continue to come in.

How surprising to learn, then, that read-only queries under snapshot isolation can sometimes return irreconcilable results? How surprising to realize that such read-only anomalies can even occur when the database exhibits serializable update behavior, meaning that concurrent transaction execution always results in a final state consistent with some sequential ordering of transactions?

Example

The example provided by Fekete, et al. involves a bank at which the customer has both a checking account and a savings account. If a withdrawal leads to a negative combined balance, then the bank assesses an overdraft fee.

In the example, both checking and savings accounts start with balance 0, and the following concurrent transactions ensue:

Txn 1: Add 20 to savings.
Txn 2: Subtract 10 from checking. If doing so causes (checking + savings) to become negative then additionally subtract 1 an overdraft fee.
Txn 3: Read balances (checking, savings).

If the deposit (Txn 1) comes ahead of the withdrawal (Txn 2) then the final combined balance is 10, whereas if the database processes these transactions in the opposite order then the final combined balance is 9.

Surprisingly, it turns out that snapshot isolation allows Txn 3 to read (checking, savings) as (0, 20), even when the final combined balance ends up as 9. Even though the database ends up in a consistent state (corresponding to withdrawal before deposit), the output it produced in Txn 3 is anomalous, it corresponds to no point-in-time snapshot of a transaction sequence that could have produced the final state!

Fekete et al. put this in context, saying the following:

Starting with [BBGMOO95], it was assumed that read-only transactions always execute serializably, without ever needing to wait or abort because of concurrent update transactions. This seemed self-evident because all reads take place at an instant of time, when all committed transactions have completed their writes and no writes of non-committed transactions are visible.

The implication is that even read-only transactions might need to choose between yielding consistent results and yielding timely results.

By the time this read-only anomaly was revealed in 2004, Oracle’s annual revenue had grown beyond $10 billion and IDC estimated the database software market at $15 billion. It’s still striking today realizing that 1) nobody recognized earlier that such foundational assumptions were wrong, and 2) that this didn’t seem to matter.

How the anomaly occurs

The table below shows sequence of reads, writes, and commits illustrating anomalous behavior under snapshot isolation.

Txn 1	Txn 2	Txn 3
	R(checking) → 0
	R(savings) → 0
R(savings) → 0
W(savings) ← 20
Commit
		R(checking) → 0
		R(savings) → 20
		Commit
	W(checking) ← -11
	Commit

Since Txn 3 starts to read after Txn 1 commits, it sees the deposit to savings. However, because Txn 2 started before Txn 1 committed, it does not see the deposit, and so assesses the overdraft fee to checking.

Txn 1 is concurrent with Tnx 2, and the final result is consistent with the serial ordering Txn 2 before Txn 1. However, the output of Txn 3 (concurrent with Txn 2 but not Txn 1) is consistent with the opposite serial ordering, Txn 1 before Txn 2. At the end of the day, the output of Txn 3 cannot be reconciled with any serial order producing the final state.

Experiments

I was so surprised to learn of this anomaly that I had to see it myself. Would databases popular in 2015 exhibit this behavior? If so, how frequently would it occur? A simple concurrency stress test provides the answers.

Anomalies per 10,000 tests

Database	READ COMMITTED	SERIALIZABLE
DB2	25	0
MySQL	0	0
Oracle	130	138
PostgreSQL	174	0.017

With the transaction isolation level configured as READ COMMITTED we expect to see some anomalous reads, and that’s ok as we have not requested that the database operates in a way consistent with a global total order. In practice they occur between 1% and 2% of the time for Oracle and PostgreSQL, and a bit less for DB2.

Oracle’s behavior under SERIALIZABLE transaction isolation level is roughly the same as under READ COMMITTED isolation, suggesting that this setting does not address the causes of read anomalies (I do see other differences in behavior, with some queries rejected under SERIALIZABLE isolation). PostgreSQL does much better, producing read-only anomalies at a rate less than one per million, and even then only when the test program retries transactions that initially error out due to contention. DB2 seems immune to anomalies when configured for serializability.

MySQL, as configured here, uses table-level locks and does not provide support for MVCC or snapshot isolation. This simple approach proves to be reliably free of anomalies

I collected these results on Amazon EC2 using two c4.2xlarge machines, one to host the database servers and another to run the client stress test program. The data above are for 50 database connections with autocommit disabled (interestingly, enabling autocommit can reduce read-only anomalies, but introduces additional anomalies on Oracle, including incorrect updates). I welcome you to review the code, and perhaps to contribute additional examples or support for other databases.

So what?

This study helped me understand a few things about concurrency:

Concurrency really means that things are happening at the same time. So long as things are happening at the same time, we don’t know what comes first, and to provide consistency we must first resolve transaction order.
Reads can behave like writes. More specifically, consistent reads demand that the database fix its state (just as measurement demands that a physical system fix its quantum state).

There are also some practical implications:

Simple can be best. Even though Oracle, DB2, and PostgreSQL provide fancy concurrent consistency capabilities, MySQL provides consistency reliably.
Snapshot isolation, MVCC, and even SERIALIZABLE transactions provide protection against anomalies that is still limited, and behavior usually varies between databases.
When correctness and consistency matter, you really need to verify your application by testing it under concurrency stress.

Since 2004 some solid research has gone into eliminating anomalies from databases operating with snapshot isolation, not only the read-only anomalies described here but others too. Some approaches modify application SQL, others modify the database’s consistency guarantees (see, e.g., the Ph.D. thesis by Michael Cahill). Intuitive consistency models make life easier for programmers, and while the tests described here show that many databases popular today don’t yet incorporate fixes for snapshot isolation anomalies, it’s reassuring to know that solutions exist.

References and further reading

Update: One semester at Berkeley

Tue, 22 Dec 2015 00:00:00 -0800

It is now December and it has been four months since I last wrote here. It’s not that I haven’t had any thoughts worth sharing. To the contrary, the recent period has been among the most stimulating for me in years. It has also been one of the busiest.

For me, starting a company was filled with weeks without weekends, and nights without sleep. Scaling the Tagged infrastructure was particularly challenging as things could break at any time, and consequently, I rarely got to set my own schedule.

Now that I’m in school, pursuing an MS in Computer Science at Berkeley, the stakes are lower—if I perform poorly I miss out on a learning opportunity, or perhaps a grade, nothing like bringing business to a standstill and interrupting service to millions of members. Still, somewhat surprisingly, I have found myself working harder than I have in years, and with less control over my schedule. The courses I chose required reading in detail 8 to 10 research papers each week, presenting research papers every few weeks, developing a research contribution in the area of big data systems, and implementing, in several different ways, a virtual machine runtime for a simple language. My guess is that for the 18 weeks my output adds up to about 2,500 lines of Scala, 4,300 lines of C, 600 lines of x86 assembly, and perhaps 2,300 lines of English. In all I met over 50 deadlines, dates that I signed up for but that I didn’t get to set.

Still, I’m happy to say that this first semester back in school has been thoroughly rewarding. Below I reflect on how my experience compares to what I might have anticipated when I last wrote here.

Expectations confirmed:

Berkeley is a community of amazingly welcoming, friendly, smart, and curious people.
The faculty are the world’s best, and spending time with them is the surest route to expert insights, be that in class, collaborating on research, or during talks or meetings.

Reassuring realization:

I’m able to fit in with the rest of the students remarkably well. While most are coming straight from college and with strong computer science education, I have no formal training but ample industry experience. Somehow we end up at a similar level, and we’re all here to learn.

Biggest surprises:

While at if(we) I almost always found myself to be the most technically versed person in the room, especially in the context of the infrastructure that we spent 10 years building. I have found it refreshing to be surrounded routinely by people who have greater technical expertise than I do, as well as to have the chance to be the “business person” in the room.
Computer science probably more about writing in English than it is about writing in code. The main currency of the field is ideas, and the ability to communicate them is paramount.

After one semester, would I recommend graduate school to other entrepreneurs? When the chance presents itself, I enthusiastically recommend taking a step back, finding a stimulating environment, and investing in yourself. Whether that means heading back school, joining a VC firm as an entrepreneur in residence, taking on consulting projects, traveling the world, or just disconnecting will depend on you and your situation. For me, the MS program in Berkeley’s Electrical Engineering and Computer Science department is turning out to be just the right thing.

Looking ahead to the spring I’m planning on a little less intense schedule, which may mean more posts here.

Why I'm going back to graduate school

Mon, 24 Aug 2015 15:00:00 -0700

When I tell friends that I’m leaving my executive role at a company that I spent ten years building to go back to school, some react with surprise, some without any surprise at all.

In Silicon Valley dropping out of school is a mark of pride for entrepreneurs, a mark that I earned myself back in 2004, when I left my Physics Ph.D. program at Stanford to build a business in social networking. My self-taught computer science background was not substantially limiting, and common wisdom says that most of what we learn in school is useless for real-world business challenges, or for the sort of software development used to build typical products.

Still, here I am, today is my first day as a graduate student in the computer science M.S. program at Berkeley. Those who know me well will recognize that I’m excited to be here because I value systematic understanding, theory if you will—that’s just the way my brain works. I’m also excited because I see that academic computer scientists are in a position to look ahead further than industry practitioners as, in addition to intellectual fruits, so many foundational technologies with near-universal industrial use originated within academia. The Berkeley EECS department history only hints at the list.

My hope is that I will come away from the program here with the solid computer science foundation that I’ve been inching towards reading textbooks on nights and weekends, as well as with the insights on current research that will carry into applications over the next decades. Also, I enjoy learning, so if nothing else this should be fun.

Leaving the executive team if(we) is a move I could only make knowing the company remains in good hands. Our management team includes veterans with an extensive mix of experience earned both working on our products and in related areas, and I look forward to continuing to work with this group in my capacity as a board member.

The need for Agile machine learning

Sun, 09 Aug 2015 23:00:00 -0700

This post outlines the motivation for the paper I am presenting at KDD this year, An Architecture for Agile Machine Learning in Real-Time Applications.

2 weeks « 6 months

Why is deploying applications based on machine learning often so slow? That’s the question we asked ourselves at if(we) following yet another 3-month cycle of product development, one that ended up showing small improvements, but one that left us questioning whether the whole effort was worthwhile.

We had a strong team, talented data scientists and talented programmers, a team with ample and relevant experience earned at well-known companies and famous universities. We were good at building software, but the complex software development required to deploy simple trained systems seemed grossly incommensurate to the relative simplicity of our models.

We wanted to iterate rapidly, to go quickly from ideas for new personalized recommendations to trained models, to production experiments, to results that would test those ideas. In practice, model deployment presented tremendous obstacles in what otherwise promised to be a powerful innovation cycle:

In many ways, these experiences reminded us of many of the problems that successful consumer internet developers had left behind a decade ago. Whereas we had gotten good at shipping simple A/B experiments quickly, often building enhancements in just a few days and using our million-user scale to take statistically sound measurements just a quickly, our experience with machine learning stood in stark contrast, with daily releases replaced by 3- to 6-month software projects. In some cases, we failed entirely, never managing to reach the requirements of scale and performance demanded by our environment, even after data scientists had demonstrated offline predictive algorithms that promised to improve the user experience.

There were our needs:

Product cycles measured in days not months
Scale to >10 million recommendation candidates
Responsiveness to current activity (updates within 10 seconds)
Instant query results for users (less than 1 second)

The platform we developed in response to these requirements routinely allows us to go from new ideas to results validated in production experiments in just two weeks. Fixing the model deployment step allowed us to fix the Agile development cycle for machine learning applications.

Data makes it different

Why don’t traditional Agile software development practices translate to machine learning products? What makes such applications uniquely challenging to deploy? Well, to a large extent, they do translate. Version control, continuous integration, project planning methodologies, and more, all can help in straightforward ways.

The key difference in machine learning is the central role of data. Even a very simple feature, say one that incorporates product popularity into rankings, needs to both draw upon both data collected before deployment as well as to incorporate future information as it becomes available. Disparate paths to data, one for the past and one for the future, introduce additional complexity, often at to an extent that greatly exceeds that inherent in the product’s functionality.

The key to alleviating deployment challenges is to provide a single path to data, one that puts past and future data behind the same abstraction, and one that is equally practical in a production setting as it is at a data scientist’s desk.

Our experience

Meet Me is the primary dating feature of the social platform powering hi5 and Tagged. The interface shows one profile at a time, allowing users to tap or swipe, voting yes or no to indicate their interest in connecting. When the attraction is mutual, when both users vote yes on one another, we call this a match and conversation can ensue. Below is a screenshot of the Tagged Android application.

Past social interactions are a potent predictor of future interactions, so using machine learning techniques was a natural evolution after heuristic attempts to improve matching showed diminishing returns.

After experiencing deployment challenges as a consequence of disparate paths to data, we re-architected our predictive platform using a design that we call “event history architecture.” In this approach, the only route to data is through a time-ordered event history that exposes a simple interface:

interface EventHistory {
	def publishEvent(e: Event)

	def getEvents(
		startTime: Date,
		endTime: Date,
		eventFilter: EventFilter,
		eventHandler: EventHandler)
}

Setting endTime to +∞ in the EventHistory interface results in real-time streaming, so the interface provides access to both historical and future data in one call to getEvents.

Since raw events are not suitable input for our machine learning algorithms, we invest in a feature engineering language that makes it easy to transform an event history into meaningful measures of user interaction. For example, a simple feature might maintain statistics on the ratio of yes to no votes for a particular user, perhaps broken down by according to age ranges, a reflection of the individual’s preferences. Our environment allows a great deal of flexibility in such transformations.

With model features derived from event history, we directly deploy to production the same feature engineering code used by data scientists in development. The cycle of software development that used to intercede between a model training and production deployment is now replaced with a few simple steps:

Deploy feature engineering code (identical to code used during development)
Deploy model parameters obtained training
Hit go: replay history with new feature engineering code, continue with real-time updates once system catches up to present.

Our paper describes our system in detail, and the Antelope open source project includes examples drawn directly from our experience. The dramatic impact on our business is shown by the usage metrics:

During an especially intense 6-month period of iterative product development we released 15 new models, updating experiment parameters 123 times along the way. Results from A/B testing showed a cumulative increase in daily users of Meet Me exceeding 30%.

The Agile advantage

Achieving rapid product cycles allows data scientist to stay focused. They can immediately evaluate how users respond to new recommendation algorithms, then use the understanding gained in further feature engineering. Teamwork also improves, as models readily deployable to production are just as readily deployable to colleague’s development environment.

For us the contrast could not be more stark: going from 6-months spent working on one model implementation, only to achieve unimpressive gains, to releasing a string of improvements over a similar time period, each building on the other and cumulatively delivering substantial improvements in both user experience and key business metrics.

In most machine learning applications, there remains a tremendous gulf between demonstrations of what is possible and the constraints of what is practical in an industry setting. Building systems based on a single path to data, building model features from an Event History API, allows rapid deployment cycles and dramatically accelerated progress.

Why start a blog in 2015?

Fri, 26 Jun 2015 00:00:00 -0700

Am I really doing this? That’s the question I ask myself as I work on assembling a redesigned website, now with a weblog (i.e., a blog). We’re well on our way into the 21st century, I abandoned my 1990s-era personal web site near the turn of the millennium, and since then I have never missed having an online presence beyond what social media offers. Current trends are toward increasing screen time on mobile devices, and on mobile devices people are spending more time in native apps and less in the web browser. I have to ask myself, who will still be reading a weblog like this in a few years?

Today’s hot platform is Medium, which upgrades blogging in many ways, offering elegance, easy of use, distribution, even beauty, all in one simple package (see, What Blogging Has Become). LinkedIn offers a publishing platform that’s effective for those seeking to reach a professional audience, and there are plenty of other blog-like services, such as Tumblr, offering easy ways to create and share online content. In contrast, a standalone personal website and blog is much more work and provides fewer social features.

Still, I’m launching this blog, even though it’s 2015, because of benefits that come from the web’s open platform. Despite flashy alternatives, despite the popularity of social media platforms, the staying power of and strength of blogs is remarkable. Traditional news formats remain under continued assault from web publishing, which ascends relentlessly. WordPress, the top blog platform, is a strong and profitable business. It sits within a field of field of active competitors, SquareSpace Weebly, and others among them. Where some platforms have fallen (e.g., LiveJournal), others have risen to take their place. And there’s healthy innovation—I’m using Jekyll, a static site generator which appeals to developers. It is clear that web’s blogging ecosystem remains healthy, in a way that only an open platform makes possible.

In addition to staying power, flexibility and control are two key reasons I’m fully embracing the open web’s open platform, why I’m starting a traditional blog rather than establishing my presence on Medium or something similar. Flexibility means my blog can be just about anything, just so long as it adheres to the web’s standards: HTML, JavaScript, HTTP, DNS, etc. Control means I get to choose how to take advantage of this flexibility, I’m not constrained by the choices of a particular vendor, nor am I coerced by them.

I’m paying for these benefits with effort, and notably, lack of social features. This highlights what I believe is an important missing piece among the internet’s open standards: identity. It would make so much more sense if, after signing into Chrome or Firefox, I didn’t merely get access to stored history and bookmarks, but also were automatically signed when visiting any service that I regularly use, and one permission tap away from signing up for anything new. Identity could also provide a building block for open standards for notifications and social features. Perhaps this is a topic that I will pursue in a future post.

The debate on the merits of open platforms, vs proprietary ones, is an old one. I’m not ready to argue that openness must win in the end, as some do, but in starting a traditional weblog I’m betting on the staying power of openness. I’m doing a little extra work, but I’m having fun with the technical challenges, and I hold out hope for a future open internet enriched by identity and social features built atop it.