Talking to the Practitioners
How are startups actually leveraging LLMs? And a week to remember
As the saying goes, there are decades where nothing happens, and there are weeks where decades happen. And as Aaron Levie insinuated above, it definitely feels like we are living in one of “those weeks”. In the past few days alone we’ve witnessed:
OpenAI launching the highly anticipated GPT-4
Google introducing AI for Workspace and opening API access to their own Lambda model
Microsoft announcing 365 Copilot, incorporating LLMs into Word, PPT, Excel, Outlook, etc.
Before diving into the meat of our post today which is how startups are leveraging foundation models, let’s quickly recap these major announcements.
It’s becoming more and more apparent that OpenAI is emerging as the heavyweight bellwether for AI. On Tuesday, they announced GPT-4, its most advanced system yet. Oren Etzioni, at the Allen Institute for AI and Venture Partner at Madrona, describes how “GPT-4 is now the standard by which all foundation models will be evaluated”.
So how exactly does it compare to OpenAI’s prior models, particularly its predecessor GPT 3.5?
Advanced reasoning → GPT-4 can better parse complex queries and prompts, delivering complete answers prior models could not
Greater accuracy → Trained on a broader set of general knowledge data, GPT-4 is more creative and capable of solving difficult problems with a higher degree of accuracy than 3.5
Safer and more aligned → GPT-4 incorporates more human feedback and applies lessons from real-world uses to improve safety and lessen (but not completely eliminate) hallucinations
One of the most stunning examples of GPT-4’s advanced capabilities against prior models is how it scores on standardized exams (like achieving 90th percentile on the bar exam in contrast to ChatGPT’s 10th percentile exam below).
However, in some areas, GPT-4 didn’t quite live up to the hype, most notably in multi-modality. The model understands images as input and can reason with them in sophisticated ways, but does not create new images and outputs the way that DALL-E does. GPT-4 also does not appear to interpret audio and video. And the model continues to be a black box; the MIT Technology Review describes how “GPT-4 is the most secretive release the company has ever put out.” The Proprietary vs. Open-Source model debate rages on.
In any event, it’s fun to see one of our key 2023 AI/ML predictions for 2023 play out in real-time. For those interested, you can read more about OpenAI GPT-4 research here. There has also been great coverage by Ben Thompson and the Algorithmic Bridge.
The previously staid Microsoft continues to deliver amazing products through its partnership with OpenAI. Just weeks after releasing AI-powered Bing, Microsoft announced 365 Copilot, bringing the magic of LLMs to its suite of applications including Word, Powerpoint, Excel, Outlook, and others. With nearly 350M paid seats in 365 and >1B total Office users, this is perhaps the largest rollout of AI for business users in history.
Copilot will be integrated in two key ways:
As an Embedded Assistant → Copilot is natively embedded into each application, so users can input prompts without leaving their workflow. For example in Word, users can input a command like “create a one-page draft based on this outline”, or in Powerpoint “build a ten-slide presentation on the stock market”.
Business Chat → This is a new modality where 365 users can interface with the Business Chat application that harnesses the power of the Microsoft Graph, bringing together information across multiple sources and dimensions. Users can ask and answer natural language queries like “tell my team how we updated the product strategy” and “summarize the chats, emails, and documents from last night’s customer escalation.”
It’s worth noting that Microsoft has deftly recognized the power of the “Copilot” branding, given that they are already seeing nearly half of all Github users coding with Copilot (and that number is likely only going to increase). It is also worth noting that the announcements from both OpenAI and Microsoft appear to have taken the wind out of the sails from Google. Perhaps (way) too early to ask the question, but is Google going the way of Kodak?
Let’s get down to business.
As VCs, we tend to pontificate on how companies can use foundation models and integrate the power of AI. However, we wanted to dig into how various startups are actually building on foundation models in practice. To do so, we surveyed six companies building intelligent applications, including four “Generative Native” apps:
Yoodli → An AI-powered speech coach
Compose AI → A Chrome plugin for automated writing
Blue Willow → An image generator for landscapes, characters, an digital artwork
Speak → An advanced AI languaged tutor
And two “Generative Enhanced” apps:
Below we dive into the questions and summarized responses (note: we sent out this survey prior to the general availability of GPT-4):
How are you leveraging FMs/LLMs into your applications today?
Yoodli uses LLMs to provide users with AI-powered speech coaching (in the form of text-based feedback). Yoodli provides users a summary of their speech (as an audience would understand it) and suggestions on conciseness, paraphrasing, and follow up questions to prepare for.
Compose AI generates text, rephrases text, autocompletes sentences, and replies to messages/email.
Blue Willow develops and serves text-to-image models for artists, designers, as well as for the general audience.
Speak uses LLMs to engage language learners in open-ended conversations and provide contextual feedback to them on how to sound more native.
Seekout uses entity extraction, text summarization, code summarization, content creation, semantic search, embeddings, and code generation to assist with talent acquisition and talent management.
Coda is building the world’s best work assistant, which can help you write, edit, and fill in tables of information intelligently. It can summarize your meetings, extract action items, or generate categories, data, and content for your powerful automated workflows.
How many models are you using today?
All the companies we surveyed are using more than one model. Interestingly, most of them are using at least one OpenAI GPT-3-based model, such as Ada, Curie, Davinci, and GPT 3.5 Turbo. While OpenAI models seem to be the most popular choice currently, some companies mentioned that they are considering evaluating other models like AI21 and Co:here.
In talking to these companies, model evaluation comes down to cost, performance, and accuracy. As models start to improve and become more fine-tuned or vertical-specific models, we predict companies may use the magnitude of a dozen models or more, leaving an opportunity for middleware companies to emerge.
What guardrails are you putting in place today or in the future to ensure accuracy and maintain user trust?
Generative AI outputs can often hallucinate and produce incorrect outputs. Building user trust is a key part of an endurable moat; however, if models start to hallucinate, users may start to lose faith in the company. All of the companies we surveyed have implemented multiple guardrails to ensure that such incidents would not occur. Below are the key measures to prevent hallucinations:
Prompt Control: Leverage prompting techniques to reduce hallucinations
Model Fine-tuning & Post-processing: Improving model accuracy and safety through user feedback
Filtering: Content and bad word filtering
Human-in-the-Loop: HITL feedback to ensure accuracy and relevancy
Operating in Lower Stakes Environment: Guiding users towards tasks where GenerativeAI has shown to be reasonably good and with a lower risk of inaccuracy
“We have safety mechanisms in place to check agent and user turns to end inappropriate conversation [and] limit users’ access to prompts and manipulating model instructions.” - Speak
What have FMs/LLMs allowed you to do today that you were previously not able to do?
The current generation of FMs/LLMs are a giant leap forward relative to previous generation of models. For many companies, every specific use case of new output would have required separate models, but now companies can use zero shot diverse generation. Blue Willow describes how these diffusion-based text-to-image models can generate images with superb quality and accuracy, which was not possible previously. Seekout also describes how these LLMs allow for reasoning over natural language and unstructured data which wasn't possible before with high enough quality.
“Previously Coda used machine learning primarily for document search and ranking recommended templates to users. Large scale generative models have unlocked the ability to bring intelligence into the all-in-one document canvas as a magical building block. We’re taking a horizontal approach, with LLMs able to assist with a huge breadth of use cases, from writing and editing, to classifying rows in tables, or summarizing tables of data. The fact that these models work so well across such a huge range of tasks has been particularly exciting for us. In the past we could have built a specialized summarization model, or a specialized classifier. But today, large language models enable a wide gamut of scenarios easily, including ones we never would have prioritized or even thought of.” - Coda
“100X-ed our speed of development. We can prototype within days (if not hours) features that could have taken us months. Things like summarization have been doable for years, but weren’t worth the time and resourcing for us so early on. Rephrasing speech might have been possible in the pre LLM era. However, it's now much easier (and less expensive) to implement.” - Yoodli
What are the biggest challenges you face in leveraging FMs/LLMs to build generative AI features and products?
As great as FMs are, some of the recurring challenges we heard consistently included:
The pace of technological change: The “state of the art” is changing quickly, making it difficult to keep up with new developments
Talent: There remains a lack of great AI/ML talent to execute new ideas fast
Cost: No surprise, but the cost of GPU-based model serving is high, particularly at scale
Use case-specific challenges: Ranges from latency, implementation, UX/UI, etc.
“Top talent is expensive and hard to hire. There is a lot of low hanging fruit in slightly more R&D space but can be difficult to prioritize as a startup.” - Compose
“The market is moving so quickly. Building on the right foundation is really important and at times tricky to predict. New capabilities, model providers, and skills unlock tons of opportunity, but also create lots of questions for application developers on where to invest now versus letting the platforms get there.” - Coda
“Our primary use case is conversation, so latency is an ongoing challenge when trying to create as natural a speaking experience as possible.” - Speak
What in Foundation Models / Generative AI are you most excited about outside of your own company?
Given the rapid pace of innovation, it’s hard to accurately predict what the “next thing” in AI is going to be. However, practitioners in the space were most excited about GPT-4 (meaning this week was like Christmas coming early), the promise of multi-modality (particularly audio and video), and general-purpose LLMs applied to a wide swath of everyday tasks.
“We are excited about all the new application possibilities opened by generative AI, especially how people can interact with these AI models to become more productive and creative.” - Blue Willow
“We’re really excited about high-quality embeddings becoming cheap and fast to produce, unlocking a plethora of powerful and scalable search and ranking opportunities.” - Coda
With the launch of GPT-4, AI for Workspaces, and 365 Copilot, advanced artificial intelligence and large language models are aggressively penetrating applications used by everyday users. It’s thrilling to see how practitioners, particularly founders of early-stage startups, are building on top of foundation models and incorporating AI.
Some observations from builders in this space:
Using multiple models, not just a single model, will be the standard
Hallucinations are going to happen, and its important to build in guardrails to maintain user trust (e.g. prompt control, filtering, HILT)
Foundation models are a complete leap forward relative to previous generations of models, and allow for novel use cases and applications not previously possible
At the same time, there are still challenges to maximizing utility of FMs, including a lack of available talent, cost at scale, and simply keeping up with the pace of change
Nobody can really predict what’s coming next :)
There’s no question that March 2023 had one of those “weeks where decades happen” in the context of AI innovation. We can’t wait to see where things go from here.
Below we highlight select private funding announcements across the Intelligent Applications sector. These deals include private Intelligent Application companies who have raised in the last two weeks, are HQ’d in the U.S. or Canada, and have raised a Seed - Series E round.
New Deal Announcements - 03/03/2023 - 03/16/2023:
Congratulations to Madrona portfolio company OthersideAI on their $2.8M second round funding. OthersidesAI’s AI writing assistant, HyperWrite, helps you write anything on the web faster and easier with personalized and context-aware AI.
Special thanks to Yoodli (Varun Puri and Esha Joshi), Compose AI (Michael Shuffett), Blue Willow (Ritankar Das), Speak (Connor Zwick & Colton Gyulay), Coda (David Kossnick), and Seekout (Aravind Bala) for their thoughts and inputs!
We hope you enjoyed this edition of Aspiring for Intelligence, and we will see you again in two weeks! This is a quickly evolving category, and we welcome any and all feedback around the viewpoints and theses expressed in this newsletter (as well as what you would like us to cover in future writeups). And it goes without saying but if you are building the next great intelligent application and want to chat, drop us a line!