Are Incumbents Accruing All The AI Value?

GitHub announced GitHub Copilot X, a suite of proposed features that integrates AI across the entire GitHub product. ChatGPT-like experiences in your IDE, Copilot for Pull Requests, AI-generated answers about documentation.

Any startups that were thinking about building these features might be second-guessing themselves now. GitHub and VSCode power a large surface of the developer workflow – will startups be able to add value here, or will GitHub and Microsoft accrue all the value?

A devil's advocate argument on why GitHub won't execute.

  • "GitHub Copilot X is currently a representation of GitHub’s vision for the future rather than an available product offering of GitHub Copilot." The announcement is simply that, an announcement. It's hard to ship. Especially for an acquired, decades-old company. It's hard to change the culture (even with paradigm shifts).
  • Changes are bolted on ideas to existing features. Not net new workflows. Why even have a pull request description? Commit message? If autogenerated, why not just on demand? The pull request workflow is not the end state of developer workflows. I've written about What Comes After Git and many other improvements that could be done at the version control or SaaS level.
  • What if these features are net negative experiences? Bad suggestions. GH becomes a verbose bag of text. Hard to roll back.
  • Misses at such a large scale can prevent a company from competing later on. Remember Google Code? Easy to integrate but easy to screw up.

Of course, you’re up against the best developer (and enterprise) distribution pipeline in the world — VSCode + GitHub + MSFT. So maybe nothing else matters.

Model Arbitrage

Large language models are especially good at generating new examples. This is used for everything from generating unit tests to generating few-shot examples. But what if you started to move past few-shot to full synthetic data sets? You get model arbitrage.

Alpaca 7B was trained for less than $600. It used OpenAI's model to expand a set of 175 human written instruction/output pairs and generate more than 52,000 instruction-following examples to train their model with. Alpaca is fine-tuned on LLaMA (from Meta), so the from-scratch cost isn't exactly $600, but the effective cost is magnitudes smaller when building on open-source models.

Sure, this is against OpenAI's terms and conditions (which is why Alpaca is "non-commercial"), but as more models become open-source, can you really stop this sort of model arbitrage?

Arbitrage will make the reasoning ability of foundational models converge. Any model that outperforms will simply be used to generate training data for others.

https://openai.com/policies/terms-of-use
Modeling Context Length vs. Information Retrieval Cost in LLMs

Large language models are unique because you can get good results with in-context learning (i.e., prompting) at inference time. This is much cheaper and more flexible than fine-tuning a model.

But what happens when you have too much data to fit in a prompt but don’t want to fine-tune it? How do you provide the proper context for those models?

You have a few choices:

  • Use the model with the most significant context window. For example, most models have a limit of 4k tokens (prompt and completion included), but GPT-4 has a window size of 32k tokens.
  • Use a vector database to perform a similarity search to filter down the relevant context for the model. Only a subset (e.g., the three most similar sentences or paragraphs) are included in the prompt.
  • Use a traditional search engine (e.g., ElasticSearch, Bing) to retrieve information. Unlike similarity search, there’s more semantic work to be done here (but possibly more relevant results).
  • Use an alternative architecture where the model does some routing to more specific models or information retrieval itself (e.g., Google’s Pathways architecture)

What will be the dominant architecture in the future? Napkin math look at the cost of different methods. It’s a bit of an apples-and-oranges comparison — there are use cases that only work with a specific method, but this just looks at the use case of augmenting in-context learning with the relevant data.

(Let’s assume 1 page ~= 500 words, and 1 sentence ~= 15 words, 1 word ~= 5 characters).

Using the largest model. With large context lengths, let’s estimate there’s a 9:1 split between prompt tokens (currently $0.06/1k tokens) and sampled tokens ($0.12/1k tokens). This comes out to a blended $0.066 / 1k tokens.

Using OpenAI’s embeddings, 1 token ~= 4 characters in English, or 100 tokens ~= 75 words.

At the full token capacity, that’s $2.112 per query containing 24,000 words.

Vector search. You can convert chunks of text to vectors, let’s say a sentence per vector for retrieval for simplicity. In practice, chunk sizes might be larger (paragraphs) or shorter (single tokens).

Vector sizes. Let’s use 1536 dimensions since that’s the size of OpenAI’s embeddings. In practice, you would probably use a lower dimensionality to store in a vector database (768 or even 256). Pinecone, a vector database, has a standard tier that roughly fits up to 5mm 768-dimensional vectors, costing $0.096/hour or ~$70/mo. This includes compute. Let’s assume this equates to 2.5mm dim(1536) vectors.

A rough calculation of the storage size required for 2.5mm 1536-dimensional vectors (assuming float32).

2.5mm vectors * 1536 dimensions * 4 bytes per dimension ~= 15GB

That’s about 1.875mm words. Significantly larger than even the largest context window. Assuming 100 queries/day, that’s $0.023 per query.

Of course, you still need to put the relevant documents in the prompt.

Essentially, as long as the similarity search reduces the query by ~1% ($0.023/$2.112) tokens, you should run the vector search first. This seems like the no-brainer it is today.

The numerator ($/vector search) and the denominator ($/token) are likely to decrease over time. However, the $/token costs are likely to fall much faster than the vector database cost. If token costs fall 10x faster, we’re looking at a 10% trade-off. Maybe a different story.

Additional costs: maintaining the vector database infrastructure, added latency to make a database call first (who knows how slow the 32k-prompt models will be — a difference calculation), the margin of error (what finds the relevant information more often?), and the developer experience (a data pipeline vs. “put it all in the prompt”).