Our Security section has grown almost as large as AI (and longer than Programming)—and that’s not including some security issues specific to AI, like model leeching. Does that mean that AI is cooling down? Or that security is heating up? It’s really impossible for security issues to get too much attention. The biggest news in AI arrived on the last day of October, and it wasn’t technical at all: the Biden administration’s executive order on AI. It will take some time to digest this, and even longer to see whether vendors follow the order’s recommendations. In itself, it’s evidence of an important ongoing trend: in the next year, many of the most important developments in AI will be legal rather than technical.
Artificial IntelligenceThis time of year, everyone publishes predictions. They’re fun, but I don’t find them a good source of insight into what’s happening in technology.
Instead of predictions, I’d prefer to look at questions: What are the questions to which I’d like answers as 2023 draws to a close? What are the unknowns that will shape 2024? That’s what I’d really like to know. Yes, I could flip a coin or two and turn these into predictions, but I’d rather leave them open-ended. Questions don’t give us the security of an answer. They force us to think, and to continue thinking. And they let us pose problems that we really can’t think about if we limit ourselves to predictions like “While individual users are getting bored with ChatGPT, enterprise use of Generative AI will continue to grow.” (Which, as predictions go, is pretty good.)
The Lawyers Are ComingThe year of tech regulation: Outside of the EU, we may be underwhelmed by the amount of proposed regulation that becomes law. However, discussion of regulation will be a major pastime of the chattering classes, and major technology companies (and venture capital firms) will be maneuvering to ensure that regulation benefits them. Regulation is a double-edged sword: while it may limit what you can do, if compliance is difficult, it gives established companies an advantage over smaller competition.
Three specific areas need watching:
Organized labor: Unions are back. How will this affect technology? I doubt that we’ll see strikes at major technology companies like Google and Amazon—but we’ve already seen a union at Bandcamp. Could this become a trend? X (Twitter) employees have plenty to be unhappy about, though many of them have immigration complications that would make unionization difficult.
The backlash against the backlash against open source: Over the past decade, a number of corporate software projects have changed from an open source license, such as Apache, to one of a number of “business source” licenses. These licenses vary, but typically restrict users from competing with the project’s vendor. When HashiCorp relicensed their widely used Terraform product as business source, their community’s reaction was strong and immediate. They formed an OpenTF consortion and forked the last open source version of Terraform, renaming it OpenTofu; OpenTofu was quickly adopted under the Linux Foundation’s mantle and appears to have significant traction among developers. In response, HashiCorp’s CEO has predicted that the rejection of business source licenses will be the end of open source.
A decade ago, we said that open source has won. More recently, developers questioned open source’s relevance in an era of web giants. In 2023, the struggle resumed. By the end of 2024, we’ll know a lot more about the answers to these questions.
Simpler, PleaseKubernetes: Everyone (well, almost everyone) is using Kubernetes to orchestrate large applications that are running in the cloud. And everyone (well, almost everyone) thinks Kubernetes is too complex. That’s no doubt true; prior to its release as an open source project, Kubernetes was Google’s Borg, the almost legendary software that ran their core applications. Kubernetes was designed for Google-scale deployments, but very few organizations need that.
We’ve long thought that a simpler alternative to Kubernetes would arrive. We haven’t seen it. We have seen some simplifications built on top of Kubernetes: K3s is one; Harpoon is a no-code drag-and-drop tool for managing Kubernetes. And all the major cloud providers offer “managed Kubernetes” services that take care of Kubernetes for you.
So our questions about container orchestration are:
From microservices to monolith: While microservices have dominated the discussion of software architecture, there have always been other voices arguing that microservices are too complex, and that monolithic applications are the way to go. Those voices are becoming more vocal. We’ve heard lots about organizations decomposing their monoliths to build collections of microservices—but in the past year we’ve heard more about organizations going the other way. So we need to ask:
AI systems are not secure: Large language models are vulnerable to new attacks like prompt injection, in which adversarial input directs the model to ignore its instructions and produce hostile output. Multimodal models share this vulnerability: it’s possible to submit an image with an invisible prompt to ChatGPT and corrupt its behavior. There is no known solution to this problem; there may never be one.
With that in mind, we have to ask:
The metaverse: It isn’t dead, but it’s not what Zuckerberg or Tim Cook thought. We’ll discover that the metaverse isn’t about wearing goggles, and it certainly isn’t about walled-off gardens. It’s about better tools for collaboration and presence. While this isn’t a big trend, we’ve seen an upswing in developers working with CRDTs and other tools for decentralized frictionless collaboration.
NFTs: NFTs are a solution looking for a problem. Enabling people with money to prove they can spend their money on bad art wasn’t a problem many people wanted to solve. But there are problems out there that they could solve, such as maintaining public records in an open immutable database. Will NFTs actually be used to solve any of these problems?
Disclaimer: Based on the announcement of the EO, without having seen the full text.
Overall, the Executive Order is a great piece of work, displaying a great deal of both expertise and thoughtfulness. It balances optimism about the potential of AI with reasonable consideration of the risks. And it doesn’t rush headlong into new regulations or the creation of new agencies, but instead directs existing agencies and organizations to understand and apply AI to their mission and areas of oversight. The EO also does an impressive job of highlighting the need to bring more AI talent into government. That’s a huge win.
Given my own research focus on enhanced disclosures as the starting point for better AI regulation, I was heartened to hear that the Executive Order on AI uses the Defense Production Act to compel disclosure of various data from the development of large AI models. Unfortunately, these disclosures do not go far enough. The EO seems to be requiring only data on the procedures and results of “Red Teaming” (i.e. adversarial testing to determine a model’s flaws and weak points), and not a wider range of information that would help to address many of the other concerns outlined in the EO. These include:
These are only a few off-the-cuff suggestions. Ideally, once a full range of required disclosures has been identified, they should be overseen by either an existing governmental standards body, or a non-profit akin to the Financial Accounting Standards Board (FASB) that oversees accounting standards. This is a rapidly-evolving field, and so disclosure is not going to be a “once-and-done” kind of activity. We are still in the early stages of the AI era, and innovation should be allowed to flourish. But this places an even greater emphasis on the need for transparency, and the establishment of baseline reporting frameworks that will allow regulators, investors, and the public to measure how successfully AI developers are managing the risks, and whether AI systems are getting better or worse over time.
UpdateAfter reading the details found in the full Executive Order on AI, rather than just the White House summary, I am far less positive about the impact of this order, and what appeared to be the first steps towards a robust disclosure regime, which is a necessary precursor to effective regulation. The EO will have no impact on the operations of current AI services like ChatGPT, Bard, and others under current development, since its requirements that model developers disclose the results of their “red teaming” of model behaviors and risks only apply to future models trained with orders of magnitude more compute power than any current model. In short, the AI companies have convinced the Biden Administration that the only risks worth regulating are the science-fiction existential risks of far future AI rather than the clear and present risks in current models.
It is true that various agencies have been tasked with considering present risks such as discrimination in hiring, criminal justice applications, and housing, as well as impacts on the job market, healthcare, education, and competition in the AI market, but those efforts are in their infancy and years off. The most important effects of the EO, in the end, turn out to be the call to increase hiring of AI talent into those agencies, and to increase their capabilities to deal with the issues raised by AI. Those effects may be quite significant over the long run, but they will have little short-term impact.
In short, the big AI companies have hit a home run in heading off any effective regulation for some years to come.
Ever since the current craze for AI-generated everything took hold, I’ve wondered: what will happen when the world is so full of AI-generated stuff (text, software, pictures, music) that our training sets for AI are dominated by content created by AI. We already see hints of that on GitHub: in February 2023, GitHub said that 46% of all the code checked in was written by Copilot. That’s good for the business, but what does that mean for future generations of Copilot? At some point in the near future, new models will be trained on code that they have written. The same is true for every other generative AI application: DALL-E 4 will be trained on data that includes images generated by DALL-E 3, Stable Diffusion, Midjourney, and others; GPT-5 will be trained on a set of texts that includes text generated by GPT-4; and so on. This is unavoidable. What does this mean for the quality of the output they generate? Will that quality improve or will it suffer?
I’m not the only person wondering about this. At least one research group has experimented with training a generative model on content generated by generative AI, and has found that the output, over successive generations, was more tightly constrained, and less likely to be original or unique. Generative AI output became more like itself over time, with less variation. They reported their results in “The Curse of Recursion,” a paper that’s well worth reading. (Andrew Ng’s newsletter has an excellent summary of this result.)
I don’t have the resources to recursively train large models, but I thought of a simple experiment that might be analogous. What would happen if you took a list of numbers, computed their mean and standard deviation, used those to generate a new list, and did that repeatedly? This experiment only requires simple statistics—no AI.
Although it doesn’t use AI, this experiment might still demonstrate how a model could collapse when trained on data it produced. In many respects, a generative model is a correlation engine. Given a prompt, it generates the word most likely to come next, then the word mostly to come after that, and so on. If the words “To be” pop out, the next word is reasonably likely to be “or”; the next word after that is even more likely to be “not”; and so on. The model’s predictions are, more or less, correlations: what word is most strongly correlated with what came before? If we train a new AI on its output, and repeat the process, what is the result? Do we end up with more variation, or less?
To answer these questions, I wrote a Python program that generated a long list of random numbers (1,000 elements) according to the Gaussian distribution with mean 0 and standard deviation 1. I took the mean and standard deviation of that list, and use those to generate another list of random numbers. I iterated 1,000 times, then recorded the final mean and standard deviation. This result was suggestive—the standard deviation of the final vector was almost always much smaller than the initial value of 1. But it varied widely, so I decided to perform the experiment (1,000 iterations) 1,000 times, and average the final standard deviation from each experiment. (1,000 experiments is overkill; 100 or even 10 will show similar results.)
When I did this, the standard deviation of the list gravitated (I won’t say “converged”) to roughly 0.45; although it still varied, it was almost always between 0.4 and 0.5. (I also computed the standard deviation of the standard deviations, though this wasn’t as interesting or suggestive.) This result was remarkable; my intuition told me that the standard deviation wouldn’t collapse. I expected it to stay close to 1, and the experiment would serve no purpose other than exercising my laptop’s fan. But with this initial result in hand, I couldn’t help going further. I increased the number of iterations again and again. As the number of iterations increased, the standard deviation of the final list got smaller and smaller, dropping to .0004 at 10,000 iterations.
I think I know why. (It’s very likely that a real statistician would look at this problem and say “It’s an obvious consequence of the law of large numbers.”) If you look at the standard deviations one iteration at a time, there’s a lot a variance. We generate the first list with a standard deviation of one, but when computing the standard deviation of that data, we’re likely to get a standard deviation of 1.1 or .9 or almost anything else. When you repeat the process many times, the standard deviations less than one, although they aren’t more likely, dominate. They shrink the “tail” of the distribution. When you generate a list of numbers with a standard deviation of 0.9, you’re much less likely to get a list with a standard deviation of 1.1—and more likely to get a standard deviation of 0.8. Once the tail of the distribution starts to disappear, it’s very unlikely to grow back.
What does this mean, if anything?
My experiment shows that if you feed the output of a random process back into its input, standard deviation collapses. This is exactly what the authors of “The Curse of Recursion” described when working directly with generative AI: “the tails of the distribution disappeared,” almost completely. My experiment provides a simplified way of thinking about collapse, and demonstrates that model collapse is something we should expect.
Model collapse presents AI development with a serious problem. On the surface, preventing it is easy: just exclude AI-generated data from training sets. But that’s not possible, at least now because tools for detecting AI-generated content have proven inaccurate. Watermarking might help, although watermarking brings its own set of problems, including whether developers of generative AI will implement it. Difficult as eliminating AI-generated content might be, collecting human-generated content could become an equally significant problem. If AI-generated content displaces human-generated content, quality human-generated content could be hard to find.
If that’s so, then the future of generative AI may be bleak. As the training data becomes ever more dominated by AI-generated output, its ability to surprise and delight will diminish. It will become predictable, dull, boring, and probably no less likely to “hallucinate” than it is now. To be unpredictable, interesting, and creative, we still need ourselves.
Anant Agarwal, an MIT professor and of the founders of the EdX educational platform, recently created a stir by saying that prompt engineering was the most important skill you could learn. And that you could learn the basics in two hours.
Although I agree that designing good prompts for AI is an important skill, Agarwal overstates his case. But before discussing why, it’s important to think about what prompt engineering means.
Attempts to define prompt engineering fall into two categories:
Designing automated prompting systems is clearly important. It gives you much more control over what an AI is likely to do; if you package the information needed to answer a question into the prompt, and tell the AI to limit its response to information included in that package, it’s much less likely to “hallucinate.” But that’s a programming task that isn’t going to be learned in a couple of hours; it typically involves generating embeddings, using a vector database, then generating a chain of prompts that are answered by different systems, combining the answers, and possibly generating more prompts. Could the basics be learned in a couple of hours? Perhaps, if the learner is already an expert programmer, but that’s ambitious—and may require a definition of “basic” that sets a very low bar.
What about the first, interactive definition? It’s worth noting that all prompts are not created equal. Prompts for ChatGPT are essentially free-form text. Free-form text sounds simple, and it is simple at first. However, more detailed prompts can look like essays, and when you take them apart, you realize that they are essentially computer programs. They tell the computer what to do, even though they aren’t written in a formal computer language. Prompts for an image generation AI like Midjourney can include sections that are written in an almost-formal metalanguage that specifies requirements like resolution, aspect ratio, styles, coordinates, and more. It’s not programming as such, but creating a prompt that produces professional-quality output is much more like programming than “a tarsier fighting with a python.”
So, the first thing anyone needs to learn about prompting is that writing really good prompts is more difficult than it seems. Your first experience with ChatGPT is likely to be “Wow, this is amazing,” but unless you get better at telling the AI precisely what you want, your 20th experience is more likely to be “Wow, this is dull.”
Second, I wouldn’t debate the claim that anyone can learn the basics of writing good prompts in a couple of hours. Chain of thought (in which the prompt includes some examples showing how to solve a problem) isn’t difficult to grasp. Neither is including evidence for the AI to use as part of the prompt. Neither are many of the other patterns that create effective prompts. There’s surprisingly little magic here. But it’s important to take a step back and think about what chain of thought requires: you need to tell the AI how to solve your problem, step by step, which means that you first need to know how to solve your problem. You need to have (or create) other examples that the AI can follow. And you need to decide whether the output the AI generates is correct. In short, you need to know a lot about the problem you’re asking the AI to solve.
That’s why many teachers, particularly in the humanities, are excited about generative AI. When used well, it’s engaging and it encourages students to learn more: learning the right questions to ask, doing the hard research to track down facts, thinking through the logic of the AI’s response carefully, deciding whether or not that response makes sense in its context. Students writing prompts for AI need to think carefully about the points they want to make, how they want to make them, and what supporting facts to use. I’ve made a similar argument about the use of AI in programming. AI tools won’t eliminate programming, but they’ll put more stress on higher-level activities: understanding user requirements, understanding software design, understanding the relationship between components of a much larger system, and strategizing about how to solve a problem. (To say nothing of debugging and testing.) If generative AI helps us put to rest the idea that programming is about antisocial people grinding out lines of code, and helps us to realize that it’s really about humans understanding problems and thinking about how to solve them, the programming profession will be in a better place.
I wouldn’t hesitate to advise anyone to spend two hours learning the basics of writing good prompts—or 4 or 8 hours, for that matter. But the real lesson here is that prompting isn’t the most important thing you can learn. To be really good at prompting, you need to develop expertise in what the prompt is about. You need to become more expert in what you’re already doing—whether that’s programming, art, or humanities. You need to be engaged with the subject matter, not the AI. The AI is only a tool: a very good tool that does things that were unimaginable only a few years ago, but still a tool. If you give in to the seduction of thinking that AI is a repository of expertise and wisdom that a human couldn’t possibly obtain, you’ll never be able to use AI productively.
I wrote a PhD dissertation on late 18th and early 19th century English literature. I didn’t get that degree so that a computer could know everything about English Romanticism for me. I got it because I wanted to know. “Wanting to know” is exactly what it will take to write good prompts. In the long run, the will to learn something yourself will be much more important than a couple of hours training in effective prompting patterns. Using AI as a shortcut so that you don’t have to learn is a big step on the road to irrelevance. The “will to learn” is what will keep you and your job relevant in an age of AI.
Ethan and Lilach Mollick’s paper Assigning AI: Seven Approaches for Students with Prompts explores seven ways to use AI in teaching. (While this paper is eminently readable, there is a non-academic version in Ethan Mollick’s Substack.) The article describes seven roles that an AI bot like ChatGPT might play in the education process: Mentor, Tutor, Coach, Student, Teammate, Student, Simulator, and Tool. For each role, it includes a detailed example of a prompt that can be used to implement that role, along with an example of a ChatGPT session using the prompt, risks of using the prompt, guidelines for teachers, instructions for students, and instructions to help teacher build their own prompts.
The Mentor role is particularly important to the work we do at O’Reilly in training people in new technical skills. Programming (like any other skill) isn’t just about learning the syntax and semantics of a programming language; it’s about learning to solve problems effectively. That requires a mentor; Tim O’Reilly has always said that our books should be like “someone wise and experienced looking over your shoulder and making recommendations.” So I decided to give the Mentor prompt a try on some short programs I’ve written. Here’s what I learned–not particularly about programming, but about ChatGPT and automated mentoring. I won’t reproduce the session (it was quite long). And I’ll say this now, and again at the end: what ChatGPT can do right now has limitations, but it will certainly get better, and it will probably get better quickly.
First, Ruby and Prime NumbersI first tried a Ruby program I wrote about 10 years ago: a simple prime number sieve. Perhaps I’m obsessed with primes, but I chose this program because it’s relatively short, and because I haven’t touched it for years, so I was somewhat unfamiliar with how it worked. I started by pasting in the complete prompt from the article (it is long), answering ChatGPT’s preliminary questions about what I wanted to accomplish and my background, and pasting in the Ruby script.
ChatGPT responded with some fairly basic advice about following common Ruby naming conventions and avoiding inline comments (Rubyists used to think that code should be self-documenting. Unfortunately). It also made a point about a puts() method call within the program’s main loop. That’s interesting–the puts() was there for debugging, and I evidently forgot to take it out. It also made a useful point about security: while a prime number sieve raises few security issues, reading command line arguments directly from ARGV rather than using a library for parsing options could leave the program open to attack.
It also gave me a new version of the program with these changes made. Rewriting the program wasn’t appropriate: a mentor should comment and provide advice, but shouldn’t rewrite your work. That should be up to the learner. However, it isn’t a serious problem. Preventing this rewrite is as simple as just adding “Do not rewrite the program” to the prompt.
Second Try: Python and Data in SpreadsheetsMy next experiment was with a short Python program that used the Pandas library to analyze survey data stored in an Excel spreadsheet. This program had a few problems–as we’ll see.
ChatGPT’s Python mentoring didn’t differ much from Ruby: it suggested some stylistic changes, such as using snake-case variable names, using f-strings (I don’t know why I didn’t; they’re one of my favorite features), encapsulating more of the program’s logic in functions, and adding some exception checking to catch possible errors in the Excel input file. It also objected to my use of “No Answer” to fill empty cells. (Pandas normally converts empty cells to NaN, “not a number,” and they’re frustratingly hard to deal with.) Useful feedback, though hardly earthshaking. It would be hard to argue against any of this advice, but at the same time, there’s nothing I would consider particularly insightful. If I were a student, I’d soon get frustrated after two or three programs yielded similar responses.
Of course, if my Python really was that good, maybe I only needed a few cursory comments about programming style–but my program wasn’t that good. So I decided to push ChatGPT a little harder. First, I told it that I suspected the program could be simplified by using the dataframe.groupby() function in the Pandas library. (I rarely use groupby(), for no good reason.) ChatGPT agreed–and while it’s nice to have a supercomputer agree with you, this is hardly a radical suggestion. It’s a suggestion I would have expected from a mentor who had used Python and Pandas to work with data. I had to make the suggestion myself.
ChatGPT obligingly rewrote the code–again, I probably should have told it not to. The resulting code looked reasonable, though it made a not-so-subtle change in the program’s behavior: it filtered out the “No answer” rows after computing percentages, rather than before. It’s important to watch out for minor changes like this when asking ChatGPT to help with programming. Such minor changes happen frequently, they look innocuous, but they can change the output. (A rigorous test suite would have helped.) This was an important lesson: you really can’t assume that anything ChatGPT does is correct. Even if it’s syntactically correct, even if it runs without error messages, ChatGPT can introduce changes that lead to errors. Testing has always been important (and under-utilized); with ChatGPT, it’s even more so.
Now for the next test. I accidentally omitted the final lines of my program, which made a number of graphs using Python’s matplotlib library. While this omission didn’t affect the data analysis (it printed the results on the terminal), several lines of code arranged the data in a way that was convenient for the graphing functions. These lines of code were now a kind of “dead code”: code that is executed, but that has no effect on the result. Again, I would have expected a human mentor to be all over this. I would have expected them to say “Look at the data structure graph_data. Where is that data used? If it isn’t used, why is it there?” I didn’t get that kind of help. A mentor who doesn’t point out problems in the code isn’t much of a mentor.
So my next prompt asked for suggestions about cleaning up the dead code. ChatGPT praised me for my insight and agreed that removing dead code was a good idea. But again, I don’t want a mentor to praise me for having good ideas; I want a mentor to notice what I should have noticed, but didn’t. I want a mentor to teach me to watch out for common programming errors, and that source code inevitably degrades over time if you’re not careful–even as it’s improved and restructured.
ChatGPT also rewrote my program yet again. This final rewrite was incorrect–this version didn’t work. (It might have done better if I had been using Code Interpreter, though Code Interpreter is no guarantee of correctness.) That both is, and is not, an issue. It’s yet another reminder that, if correctness is a criterion, you have to check and test everything ChatGPT generates carefully. But–in the context of mentoring–I should have written a prompt that suppressed code generation; rewriting your program isn’t the mentor’s job. Furthermore, I don’t think it’s a terrible problem if a mentor occasionally gives you poor advice. We’re all human (at least, most of us). That’s part of the learning experience. And it’s important for us to find applications for AI where errors are tolerable.
So, what’s the score?
Mentoring is an important application for language models, not the least because it finesses one of their biggest problems, their tendency to make mistakes and create errors. A mentor that occasionally makes a bad suggestion isn’t really a problem; following the suggestion and discovering that it’s a dead end is an important learning experience in itself. You shouldn’t believe everything you hear, even if it comes from a reliable source. And a mentor really has no business generating code, incorrect or otherwise.
I’m more concerned about ChatGPT’s difficulty in providing advice that’s truly insightful, the kind of advice that you really want from a mentor. It is able to provide advice when you ask it about specific problems–but that’s not enough. A mentor needs to help a student explore problems; a student who is already aware of the problem is well on their way towards solving it, and may not need the mentor at all.
ChatGPT and other language models will inevitably improve, and their ability to act as a mentor will be important to people who are building new kinds of learning experiences. But they haven’t arrived yet. For the time being, if you want a mentor, you’re on your own.
AI continues to spread. This month, the AI category is limited to developments about AI itself; tools for AI programming are covered in the Programming section.
One of the biggest issues for AI these days is legal. Getty Images is protecting customers who use their generative AI from copyright lawsuits; Microsoft is doing the same for users of their Copilot products.
Also on the legal front: Hashicorp’s switch to a non-open source license has led the OpenTF foundation to build OpenTofu, a fork of Hashicorp’s Terraform product. While it’s too early to say, OpenTofu has quickly gotten some significant adopters.
AII am wired to constantly ask “what’s next?” Sometimes, the answer is: “more of the same.”
That came to mind when a friend raised a point about emerging technology’s fractal nature. Across one story arc, they said, we often see several structural evolutions—smaller-scale versions of that wider phenomenon.
Cloud computing? It progressed from “raw compute and storage” to “reimplementing key services in push-button fashion” to “becoming the backbone of AI work”—all under the umbrella of “renting time and storage on someone else’s computers.” Web3 has similarly progressed through “basic blockchain and cryptocurrency tokens” to “decentralized finance” to “NFTs as loyalty cards.” Each step has been a twist on “what if we could write code to interact with a tamper-resistant ledger in real-time?”
Most recently, I’ve been thinking about this in terms of the space we currently call “AI.” I’ve called out the data field’s rebranding efforts before; but even then, I acknowledged that these weren’t just new coats of paint. Each time, the underlying implementation changed a bit while still staying true to the larger phenomenon of “Analyzing Data for Fun and Profit.”
Consider the structural evolutions of that theme:
Stage 1: Hadoop and Big DataBy 2008, many companies found themselves at the intersection of “a steep increase in online activity” and “a sharp decline in costs for storage and computing.” They weren’t quite sure what this “data” substance was, but they’d convinced themselves that they had tons of it that they could monetize. All they needed was a tool that could handle the massive workload. And Hadoop rolled in.
In short order, it was tough to get a data job if you didn’t have some Hadoop behind your name. And harder to sell a data-related product unless it spoke to Hadoop. The elephant was unstoppable.
Until it wasn’t.
Hadoop’s value—being able to crunch large datasets—often paled in comparison to its costs. A basic, production-ready cluster priced out to the low-six-figures. A company then needed to train up their ops team to manage the cluster, and their analysts to express their ideas in MapReduce. Plus there was all of the infrastructure to push data into the cluster in the first place.
If you weren’t in the terabytes-a-day club, you really had to take a step back and ask what this was all for. Doubly so as hardware improved, eating away at the lower end of Hadoop-worthy work.
And then there was the other problem: for all the fanfare, Hadoop was really large-scale business intelligence (BI).
(Enough time has passed; I think we can now be honest with ourselves. We built an entire industry by … repackaging an existing industry. This is the power of marketing.)
Don’t get me wrong. BI is useful. I’ve sung its praises time and again. But the grouping and summarizing just wasn’t exciting enough for the data addicts. They’d grown tired of learning what is; now they wanted to know what’s next.
Stage 2: Machine learning modelsHadoop could kind of do ML, thanks to third-party tools. But in its early form of a Hadoop-based ML library, Mahout still required data scientists to write in Java. And it (wisely) stuck to implementations of industry-standard algorithms. If you wanted ML beyond what Mahout provided, you had to frame your problem in MapReduce terms. Mental contortions led to code contortions led to frustration. And, often, to giving up.
(After coauthoring Parallel R I gave a number of talks on using Hadoop. A common audience question was “can Hadoop run [my arbitrary analysis job or home-grown algorithm]?” And my answer was a qualified yes: “Hadoop could theoretically scale your job. But only if you or someone else will take the time to implement that approach in MapReduce.” That didn’t go over well.)
Goodbye, Hadoop. Hello, R and scikit-learn. A typical data job interview now skipped MapReduce in favor of white-boarding k-means clustering or random forests.
And it was good. For a few years, even. But then we hit another hurdle.
While data scientists were no longer handling Hadoop-sized workloads, they were trying to build predictive models on a different kind of “large” dataset: so-called “unstructured data.” (I prefer to call that “soft numbers,” but that’s another story.) A single document may represent thousands of features. An image? Millions.
Similar to the dawn of Hadoop, we were back to problems that existing tools could not solve.
The solution led us to the next structural evolution. And that brings our story to the present day:
Stage 3: Neural networksHigh-end video games required high-end video cards. And since the cards couldn’t tell the difference between “matrix algebra for on-screen display” and “matrix algebra for machine learning,” neural networks became computationally feasible and commercially viable. It felt like, almost overnight, all of machine learning took on some kind of neural backend. Those algorithms packaged with scikit-learn? They were unceremoniously relabeled “classical machine learning.”
There’s as much Keras, TensorFlow, and Torch today as there was Hadoop back in 2010-2012. The data scientist—sorry, “machine learning engineer” or “AI specialist”—job interview now involves one of those toolkits, or one of the higher-level abstractions such as HuggingFace Transformers.
And just as we started to complain that the crypto miners were snapping up all of the affordable GPU cards, cloud providers stepped up to offer access on-demand. Between Google (Vertex AI and Colab) and Amazon (SageMaker), you can now get all of the GPU power your credit card can handle. Google goes a step further in offering compute instances with its specialized TPU hardware.
Not that you’ll even need GPU access all that often. A number of groups, from small research teams to tech behemoths, have used their own GPUs to train on large, interesting datasets and they give those models away for free on sites like TensorFlow Hub and Hugging Face Hub. You can download these models to use out of the box, or employ minimal compute resources to fine-tune them for your particular task.
You see the extreme version of this pretrained model phenomenon in the large language models (LLMs) that drive tools like Midjourney or ChatGPT. The overall idea of generative AI is to get a model to create content that could have reasonably fit into its training data. For a sufficiently large training dataset—say, “billions of online images” or “the entirety of Wikipedia”—a model can pick up on the kinds of patterns that make its outputs seem eerily lifelike.
Since we’re covered as far as compute power, tools, and even prebuilt models, what are the frictions of GPU-enabled ML? What will drive us to the next structural iteration of Analyzing Data for Fun and Profit?
Stage 4? SimulationGiven the progression thus far, I think the next structural evolution of Analyzing Data for Fun and Profit will involve a new appreciation for randomness. Specifically, through simulation.
You can see a simulation as a temporary, synthetic environment in which to test an idea. We do this all the time, when we ask “what if?” and play it out in our minds. “What if we leave an hour earlier?” (We’ll miss rush hour traffic.) “What if I bring my duffel bag instead of the roll-aboard?” (It will be easier to fit in the overhead storage.) That works just fine when there are only a few possible outcomes, across a small set of parameters.
Once we’re able to quantify a situation, we can let a computer run “what if?” scenarios at industrial scale. Millions of tests, across as many parameters as will fit on the hardware. It’ll even summarize the results if we ask nicely. That opens the door to a number of possibilities, three of which I’ll highlight here:
Moving beyond from point estimatesLet’s say an ML model tells us that this house should sell for $744,568.92. Great! We’ve gotten a machine to make a prediction for us. What more could we possibly want?
Context, for one. The model’s output is just a single number, a point estimate of the most likely price. What we really want is the spread—the range of likely values for that price. Does the model think the correct price falls between $743k-$746k? Or is it more like $600k-$900k? You want the former case if you’re trying to buy or sell that property.
Bayesian data analysis, and other techniques that rely on simulation behind the scenes, offer additional insight here. These approaches vary some parameters, run the process a few million times, and give us a nice curve that shows how often the answer is (or, “is not”) close to that $744k.
Similarly, Monte Carlo simulations can help us spot trends and outliers in potential outcomes of a process. “Here’s our risk model. Let’s assume these ten parameters can vary, then try the model with several million variations on those parameter sets. What can we learn about the potential outcomes?” Such a simulation could reveal that, under certain specific circumstances, we get a case of total ruin. Isn’t it nice to uncover that in a simulated environment, where we can map out our risk mitigation strategies with calm, level heads?
Moving beyond point estimates is very close to present-day AI challenges. That’s why it’s a likely next step in Analyzing Data for Fun and Profit. In turn, that could open the door to other techniques:
New ways of exploring the solution spaceIf you’re not familiar with evolutionary algorithms, they’re a twist on the traditional Monte Carlo approach. In fact, they’re like several small Monte Carlo simulations run in sequence. After each iteration, the process compares the results to its fitness function, then mixes the attributes of the top performers. Hence the term “evolutionary”—combining the winners is akin to parents passing a mix of their attributes on to progeny. Repeat this enough times and you may just find the best set of parameters for your problem.
(People familiar with optimization algorithms will recognize this as a twist on simulated annealing: start with random parameters and attributes, and narrow that scope over time.)
A number of scholars have tested this shuffle-and-recombine-till-we-find-a-winner approach on timetable scheduling. Their research has applied evolutionary algorithms to groups that need efficient ways to manage finite, time-based resources such as classrooms and factory equipment. Other groups have tested evolutionary algorithms in drug discovery. Both situations benefit from a technique that optimizes the search through a large and daunting solution space.
The NASA ST5 antenna is another example. Its bent, twisted wire stands in stark contrast to the straight aerials with which we are familiar. There’s no chance that a human would ever have come up with it. But the evolutionary approach could, in part because it was not limited by human sense of aesthetic or any preconceived notions of what an “antenna” could be. It just kept shuffling the designs that satisfied its fitness function until the process finally converged.
Taming complexityComplex adaptive systems are hardly a new concept, though most people got a harsh introduction at the start of the Covid-19 pandemic. Cities closed down, supply chains snarled, and people—independent actors, behaving in their own best interests—made it worse by hoarding supplies because they thought distribution and manufacturing would never recover. Today, reports of idle cargo ships and overloaded seaside ports remind us that we shifted from under- to over-supply. The mess is far from over.
What makes a complex system troublesome isn’t the sheer number of connections. It’s not even that many of those connections are invisible because a person can’t see the entire system at once. The problem is that those hidden connections only become visible during a malfunction: a failure in Component B affects not only neighboring Components A and C, but also triggers disruptions in T and R. R’s issue is small on its own, but it has just led to an outsized impact in Φ and Σ.
(And if you just asked “wait, how did Greek letters get mixed up in this?” then … you get the point.)
Our current crop of AI tools is powerful, yet ill-equipped to provide insight into complex systems. We can’t surface these hidden connections using a collection of independently-derived point estimates; we need something that can simulate the entangled system of independent actors moving all at once.
This is where agent-based modeling (ABM) comes into play. This technique simulates interactions in a complex system. Similar to the way a Monte Carlo simulation can surface outliers, an ABM can catch unexpected or unfavorable interactions in a safe, synthetic environment.
Financial markets and other economic situations are prime candidates for ABM. These are spaces where a large number of actors behave according to their rational self-interest, and their actions feed into the system and affect others’ behavior. According to practitioners of complexity economics (a study that owes its origins to the Sante Fe Institute), traditional economic modeling treats these systems as though they run in an equilibrium state and therefore fails to identify certain kinds of disruptions. ABM captures a more realistic picture because it simulates a system that feeds back into itself.
Smoothing the on-rampInterestingly enough, I haven’t mentioned anything new or ground-breaking. Bayesian data analysis and Monte Carlo simulations are common in finance and insurance. I was first introduced to evolutionary algorithms and agent-based modeling more than fifteen years ago. (If memory serves, this was shortly before I shifted my career to what we now call AI.) And even then I was late to the party.
So why hasn’t this next phase of Analyzing Data for Fun and Profit taken off?
For one, this structural evolution needs a name. Something to distinguish it from “AI.” Something to market. I’ve been using the term “synthetics,” so I’ll offer that up. (Bonus: this umbrella term neatly includes generative AI’s ability to create text, images, and other realistic-yet-heretofore-unseen data points. So we can ride that wave of publicity.)
Next up is compute power. Simulations are CPU-heavy, and sometimes memory-bound. Cloud computing providers make that easier to handle, though, so long as you don’t mind the credit card bill. Eventually we’ll get simulation-specific hardware—what will be the GPU or TPU of simulation?—but I think synthetics can gain traction on existing gear.
The third and largest hurdle is the lack of simulation-specific frameworks. As we surface more use cases—as we apply these techniques to real business problems or even academic challenges—we’ll improve the tools because we’ll want to make that work easier. As the tools improve, that reduces the costs of trying the techniques on other use cases. This kicks off another iteration of the value loop. Use cases tend to magically appear as techniques get easier to use.
If you think I’m overstating the power of tools to spread an idea, imagine trying to solve a problem with a new toolset while also creating that toolset at the same time. It’s tough to balance those competing concerns. If someone else offers to build the tool while you use it and road-test it, you’re probably going to accept. This is why these days we use TensorFlow or Torch instead of hand-writing our backpropagation loops.
Today’s landscape of simulation tooling is uneven. People doing Bayesian data analysis have their choice of two robust, authoritative offerings in Stan and PyMC3, plus a variety of books to understand the mechanics of the process. Things fall off after that. Most of the Monte Carlo simulations I’ve seen are of the hand-rolled variety. And a quick survey of agent-based modeling and evolutionary algorithms turns up a mix of proprietary apps and nascent open-source projects, some of which are geared for a particular problem domain.
As we develop the authoritative toolkits for simulations—the TensorFlow of agent-based modeling and the Hadoop of evolutionary algorithms, if you will—expect adoption to grow. Doubly so, as commercial entities build services around those toolkits and rev up their own marketing (and publishing, and certification) machines.
Time will tellMy expectations of what to come are, admittedly, shaped by my experience and clouded by my interests. Time will tell whether any of this hits the mark.
A change in business or consumer appetite could also send the field down a different road. The next hot device, app, or service will get an outsized vote in what companies and consumers expect of technology.
Still, I see value in looking for this field’s structural evolutions. The wider story arc changes with each iteration to address changes in appetite. Practitioners and entrepreneurs, take note.
Job-seekers should do the same. Remember that you once needed Hadoop on your résumé to merit a second look; nowadays it’s a liability. Building models is a desired skill for now, but it’s slowly giving way to robots. So do you really think it’s too late to join the data field? I think not.
Keep an eye out for that next wave. That’ll be your time to jump in.
A few weeks ago, I saw a tweet that said “Writing code isn’t the problem. Controlling complexity is.” I wish I could remember who said that; I will be quoting it a lot in the future. That statement nicely summarizes what makes software development difficult. It’s not just memorizing the syntactic details of some programming language, or the many functions in some API, but understanding and managing the complexity of the problem you’re trying to solve.
We’ve all seen this many times. Lots of applications and tools start simple. They do 80% of the job well, maybe 90%. But that isn’t quite enough. Version 1.1 gets a few more features, more creep into version 1.2, and by the time you get to 3.0, an elegant user interface has turned into a mess. This increase in complexity is one reason that applications tend to become less useable over time. We also see this phenomenon as one application replaces another. RCS was useful, but didn’t do everything we needed it to; SVN was better; Git does just about everything you could want, but at an enormous cost in complexity. (Could Git’s complexity be managed better? I’m not the one to say.) OS X, which used to trumpet “It just works,” has evolved to “it used to just work”; the most user-centric Unix-like system ever built now staggers under the load of new and poorly thought-out features.
The problem of complexity isn’t limited to user interfaces; that may be the least important (though most visible) aspect of the problem. Anyone who works in programming has seen the source code for some project evolve from something short, sweet, and clean to a seething mass of bits. (These days, it’s often a seething mass of distributed bits.) Some of that evolution is driven by an increasingly complex world that requires attention to secure programming, cloud deployment, and other issues that didn’t exist a few decades ago. But even here: a requirement like security tends to make code more complex—but complexity itself hides security issues. Saying “yes, adding security made the code more complex” is wrong on several fronts. Security that’s added as an afterthought almost always fails. Designing security in from the start almost always leads to a simpler result than bolting security on as an afterthought, and the complexity will stay manageable if new features and security grow together. If we’re serious about complexity, the complexity of building secure systems needs to be managed and controlled in step with the rest of the software, otherwise it’s going to add more vulnerabilities.
That brings me to my main point. We’re seeing more code that’s written (at least in first draft) by generative AI tools, such as GitHub Copilot, ChatGPT (especially with Code Interpreter), and Google Codey. One advantage of computers, of course, is that they don’t care about complexity. But that advantage is also a significant disadvantage. Until AI systems can generate code as reliably as our current generation of compilers, humans will need to understand—and debug—the code they write. Brian Kernighan wrote that “Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?” We don’t want a future that consists of code too clever to be debugged by humans—at least not until the AIs are ready to do that debugging for us. Really brilliant programmers write code that finds a way out of the complexity: code that may be a little longer, a little clearer, a little less clever so that someone can understand it later. (Copilot running in VSCode has a button that simplifies code, but its capabilities are limited.)
Furthermore, when we’re considering complexity, we’re not just talking about individual lines of code and individual functions or methods. Most professional programmers work on large systems that can consist of thousands of functions and millions of lines of code. That code may take the form of dozens of microservices running as asynchronous processes and communicating over a network. What is the overall structure, the overall architecture, of these programs? How are they kept simple and manageable? How do you think about complexity when writing or maintaining software that may outlive its developers? Millions of lines of legacy code going back as far as the 1960s and 1970s are still in use, much of it written in languages that are no longer popular. How do we control complexity when working with these?
Humans don’t manage this kind of complexity well, but that doesn’t mean we can check out and forget about it. Over the years, we’ve gradually gotten better at managing complexity. Software architecture is a distinct specialty that has only become more important over time. It’s growing more important as systems grow larger and more complex, as we rely on them to automate more tasks, and as those systems need to scale to dimensions that were almost unimaginable a few decades ago. Reducing the complexity of modern software systems is a problem that humans can solve—and I haven’t yet seen evidence that generative AI can. Strictly speaking, that’s not a question that can even be asked yet. Claude 2 has a maximum context—the upper limit on the amount of text it can consider at one time—of 100,000 tokens1; at this time, all other large language models are significantly smaller. While 100,000 tokens is huge, it’s much smaller than the source code for even a moderately sized piece of enterprise software. And while you don’t have to understand every line of code to do a high-level design for a software system, you do have to manage a lot of information: specifications, user stories, protocols, constraints, legacies and much more. Is a language model up to that?
Could we even describe the goal of “managing complexity” in a prompt? A few years ago, many developers thought that minimizing “lines of code” was the key to simplification—and it would be easy to tell ChatGPT to solve a problem in as few lines of code as possible. But that’s not really how the world works, not now, and not back in 2007. Minimizing lines of code sometimes leads to simplicity, but just as often leads to complex incantations that pack multiple ideas onto the same line, often relying on undocumented side effects. That’s not how to manage complexity. Mantras like DRY (Don’t Repeat Yourself) are often useful (as is most of the advice in The Pragmatic Programmer), but I’ve made the mistake of writing code that was overly complex to eliminate one of two very similar functions. Less repetition, but the result was more complex and harder to understand. Lines of code are easy to count, but if that’s your only metric, you will lose track of qualities like readability that may be more important. Any engineer knows that design is all about tradeoffs—in this case, trading off repetition against complexity—but difficult as these tradeoffs may be for humans, it isn’t clear to me that generative AI can make them any better, if at all.
I’m not arguing that generative AI doesn’t have a role in software development. It certainly does. Tools that can write code are certainly useful: they save us looking up the details of library functions in reference manuals, they save us from remembering the syntactic details of the less commonly used abstractions in our favorite programming languages. As long as we don’t let our own mental muscles decay, we’ll be ahead. I am arguing that we can’t get so tied up in automatic code generation that we forget about controlling complexity. Large language models don’t help with that now, though they might in the future. If they free us to spend more time understanding and solving the higher-level problems of complexity, though, that will be a significant gain.
Will the day come when a large language model will be able to write a million line enterprise program? Probably. But someone will have to write the prompt telling it what to do. And that person will be faced with the problem that has characterized programming from the start: understanding complexity, knowing where it’s unavoidable, and controlling it.
FootnotesWhile the AI group is still the largest, it’s notable that Programming, Web, and Security are all larger than they’ve been in recent months. One reason is certainly that we’re pushing AI news into other categories as appropriate. But I also think that it’s harder to impress with AI than it used to be. AI discussions have been much more about regulation and intellectual property—which makes me wonder whether legislation should be a separate category.
That notwithstanding, it’s important that OpenAI is now allowing API users to fine-tune their GPT-4 apps. It’s as-a-service, of course. And RISC-V finally appears to be getting some serious adoption. Could it compete with Atom and Intel? We shall see.
AITo follow up on our previous survey about low-code and no-code tools, we decided to run another short survey about tools specifically for software developers—including, but not limited to, GitHub Copilot and ChatGPT. We’re interested in how “developer enablement” tools of all sorts are changing the workplace. Our survey 1 showed that while these tools increased productivity, they aren’t without their costs. Both upskilling and retraining developers to use these tools are issues.
Few professional software developers will find it surprising that software development teams are respondents said that productivity is the biggest challenge their organization faced, and another 19% said that time to market and deployment speed are the biggest challenges. Those two answers are almost the same: decreasing time to market requires increasing productivity, and improving deployment speed is itself an increase in productivity. Together, those two answers represented 48% of the respondents, just short of half.
HR issues were the second-most-important challenge, but they’re nowhere near as pressing. 12% of the respondents reported that job satisfaction is the greatest challenge; 11% said that there aren’t good job candidates to hire; and 10% said that employee retention is the biggest issue. Those three challenges total 33%, just one-third of the respondents.
It’s heartening to realize that hiring and retention are still challenges in this time of massive layoffs, but it’s also important to realize that these issues are less important than productivity.
But the big issue, the issue we wanted to explore, isn’t the challenges themselves; it’s what organizations are doing to meet them. A surprisingly large percentage of respondents (28%) aren’t making any changes to become more productive. But 20% are changing their onboarding and upskilling processes, 15% are hiring new developers, and 13% are using self-service engineering platforms.
We found that the biggest struggle for developers working with new tools is training (34%), and another 12% said the biggest struggle is “ease of use.” Together, that’s almost half of all respondents (46%). That was a surprise, since many of these tools are supposed to be low- or no-code. We’re thinking specifically about tools like GitHub Copilot, Amazon CodeWhisperer, and other code generators, but almost all productivity tools claim to make life simpler. At least at first, that’s clearly not true. There’s a learning curve, and it appears to be steeper than we’d have guessed. It’s also worth noting that 13% of the respondents said that the tools “didn’t effectively solve the problems that developers face.”
Over half of the respondents (51%) said that their organizations are using self-service deployment pipelines to increase productivity. Another 13% said that while they’re using self-service pipelines, they haven’t seen an increase in productivity. So almost two-thirds of the respondents are using self-service pipelines for deployment, and for most of them, the pipelines are working—reducing the overhead required to put new projects into production.
Finally, we wanted to know specifically about the effect of GitHub Copilot, ChatGPT, and other AI-based programming tools. Two-thirds of the respondents (67%) reported that these tools aren’t in use at their organizations. We suspect this estimate is lowballing Copilot’s actual usage. Back in the early 2000s, a widely quoted survey reported that CIOs almost unanimously said that their IT organizations weren’t making use of open source. How little they knew! Actual usage of Copilot, ChatGPT, and similar tools is likely to be much higher than 33%. We’re sure that even if they aren’t using Copilot or ChatGPT on the job, many programmers are experimenting with these tools or using them on personal projects.
What about the 33% who reported that Copilot and ChatGPT are in use at their organizations? First, realize that these are early adopters: Copilot was only released a year and a half ago, and ChatGPT has been out for less than a year. It’s certainly significant that they (and similar tools) have grabbed a third of the market in that short a period. It’s also significant that making a commitment to a new way of programming—and these tools are nothing if not a new kind of programming—is a much bigger change than, say, signing up for a ChatGPT account.
11% of the respondents said their organizations use Copilot and ChatGPT, and that the tools are primarily useful to junior developers; 13% said they’re primarily useful to senior developers. Another 9% said that the tools haven’t yielded an increase in productivity. The difference between junior and senior developers is closer than we expected. Common wisdom is that Copilot is more of an advantage to senior programmers, who are better able to describe the problem they need to solve in an intricate set of prompts and to notice bugs in the generated code quickly. Our survey hints that the difference between senior and junior developers is relatively small—although they’re almost certainly using Copilot in different ways. Junior developers are using it to learn and to spend less time solving problems by looking up solutions on Stack Overflow or searching online documentation. Senior developers are using it to help design and structure systems, and even to create production code.
Is developer productivity an issue? Of course; it always is. Part of the solution is improved tooling: self-service deployment, code-generation tools, and other new technologies and ideas. Productivity tools—and specifically the successors to tools like Copilot—are remaking software development in radical ways. Software developers are getting value from these tools, but don’t let the buzz fool you: that value doesn’t come for free. Nobody’s going to sit down with ChatGPT, type “Generate an enterprise application for selling shoes,” and come away with something worthwhile. Each has its own learning curve, and it’s easy to underestimate how steep that curve can be. Developer productivity tools will be a big part of the future; but to take full advantage of those tools, organizations will need to plan for skills development.
I’m sure that nobody will be surprised that the number of searches for ChatGPT on the O’Reilly learning platform skyrocketed after its release in November, 2022. It might be a surprise how quickly it got to the top of our charts: it peaked in May as the 6th most common search query. Then it dropped almost as quickly: it dropped back to #8 in June, and fell further to #19 in July. At its peak, ChatGPT was in very exclusive company: it’s not quite on the level of Python, Kubernetes, and Java, but it’s in the mix with AWS and React, and significantly ahead of Docker.
A look at the number of searches for terms commonly associated with AI shows how dramatic this rise was:
ChatGPT came from nowhere to top all the AI search terms except for Machine Learning itself, which is consistently our #3 search term—and, despite ChatGPT’s dramatic decline in June and July, it’s still ahead of all other search terms relevant to AI. The number of searches for Machine Learning itself held steady, though it arguably declined slightly when ChatGPT appeared. What’s more interesting, though, is that the search term “Generative AI” suddenly emerged from the pack as the third most popular search term. If current trends continue, in August we might see more searches for Generative AI than for ChatGPT.
What can we make of this? Everyone knows that ChatGPT had one of the most successful launches of any software project, passing a million users in its first five days. (Since then, it’s been beaten by Facebook’s Threads, though that’s not really a fair comparison.) There are plenty of reasons for this surge. Talking computers have been a science fiction dream since well before Star Trek—by itself, that’s a good reason for the public’s fascination. ChatGPT might simplify common tasks, from doing research to writing essays to basic programming, so many people want to use it to save labor—though getting it to do quality work is more difficult than it seems at first glance. (We’ll leave the issue of whether this is “cheating” to the users, their teachers, and their employers.) And, while I’ve written frequently about how ChatGPT will change programming, it will undoubtedly have an even greater effect on non-programmers. It will give them the chance to tell computers what to do without programming; it’s the ultimate “low code” experience.
So there are plenty of reasons for ChatGPT to surge. What about other search terms? It’s easy to dismiss these search queries as also-rans, but they were all in the top 300 for May, 2023—and we typically have a few million unique search terms per month. Removing ChatGPT and Machine Learning from the previous graph makes it easier to see trends in the other popular search terms:
It’s mostly “up and to the right.” Three search terms stand out: Generative AI, LLM, and Langchain all follow similar curves: they start off with relatively moderate growth that suddenly becomes much steeper in February, 2023. We’ve already noted that the number of searches for Generative AI increased sharply since the release of ChatGPT, and haven’t declined in the past two months. Our users evidently prefer LLM to spelling out “Large Language Models,” but if you add these two search terms together, the total number of searches for July is within 1% of Generative AI. This surge didn’t really start until last November, when it was spurred by the appearance of ChatGPT—even though search terms like LLM were already in circulation because of GPT-3, DALL-E, StableDiffusion, Midjourney, and other language-based generative AI tools.
Unlike LLMs, Langchain didn’t exist prior to ChatGPT—but once it appeared, the number of searches took off rapidly, and didn’t decline in June and July. That makes sense; although it’s still early, Langchain looks like it will be the cornerstone of LLM-based software development. It’s a widely used platform for building applications that generate queries programmatically and that connects LLMs with each other, with databases, and with other software. Langchain is frequently used to look up relevant articles that weren’t in ChatGPT’s training data and package them as part of a lengthy prompt.
In this group, the only search term that seems to be in a decline is Natural Language Processing. Although large language models clearly fall into the category of NLP, we suspect that most users associate NLP with older approaches to building chatbots. Searches for Artificial Intelligence appear to be holding their own, though it’s surprising that there are so few searches for AI compared to Machine Learning. The difference stems from O’Reilly’s audience, which is relatively technical and prefers the more precise term Machine Learning. Nevertheless, the number of searches for AI rose with the release of ChatGPT, possibly because ChatGPT’s appeal wasn’t limited to the technical community.
Now that we’ve run through the data, we’re left with the big question: What happened to ChatGPT? Why did it decline from roughly 5,000 searches to slightly over 2,500 in a period of two months? There are many possible reasons. Perhaps students stopped using ChatGPT for homework assignments as graduation and summer vacation approached. Perhaps ChatGPT has saturated the world; people know what they need to know, and are waiting for the next blockbuster. An article in Ars Technica notes that ChatGPT usage declined from May to June, and suggests many possible causes, including attention to the Twitter/Threads drama and frustration because OpenAI implemented stricter guardrails to prevent abuse. It would be unfortunate if ChatGPT usage is declining because people can’t use it to generate abusive content, but that’s a different article…
A more important reason for this decline might be that ChatGPT is no longer the only game in town. There are now many alternative language models. Most of these alternatives descend from Meta’s LLaMA and Georgi Gerganov’s llama.cpp (which can run on laptops, cell phones, and even Raspberry Pi). Users can train these models to do whatever they want. Some of these models already have chat interfaces, and all of them could support chat interfaces with some fairly simple programming. None of these alternatives generate significant search traffic at O’Reilly, but that doesn’t mean that they won’t in the future, or that they aren’t an important part of the ecosystem. Their proliferation is an important piece of evidence about what’s happening among O’Reilly’s users. AI developers now need to ask a question that didn’t even exist last November: should they build on large foundation models like ChatGPT or Google’s Bard, using public APIs and paying by the token? Or should they start with an open source model that can run locally and be trained for their specific application?
This last explanation makes a lot of sense in context. We’ve moved beyond the initial phase, when ChatGPT was a fascinating toy. We’re now building applications and incorporating language models into products, so trends in search terms have shifted accordingly. A developer interested in building with large language models needs more context; learning about ChatGPT by itself isn’t enough. Developers who want to learn about language models need different kinds of information, information that’s both deeper and broader. They need to learn about how generative AI works, about new LLMs, about programming with Langchain and other platforms. All of these search terms increased while ChatGPT declined. Now that there are options, and now that everyone has had a chance to try out ChatGPT, the first step in an AI project isn’t to search for ChatGPT. It’s to get a sense of the landscape, to discover the possibilities.
Searches for ChatGPT peaked quickly, and are now declining rapidly—and who knows what August and September will bring? (We wouldn’t be surprised to see ChatGPT bounce back as students return to school and homework assignments.) The real news is that ChatGPT is no longer the whole story: you can’t look at the decline in ChatGPT without also considering what else our users are searching for as they start building AI into other projects. Large language models are very clearly part of the future. They will change the way we work and live, and we’re just at the start of the revolution.
Artificial Intelligence continues to dominate the news. In the past month, we’ve seen a number of major updates to language models: Claude 2, with its 100,000 token context limit; LLaMA 2, with (relatively) liberal restrictions on use; and Stable Diffusion XL, a significantly more capable version of Stable Diffusion. Does Claude 2’s huge context really change what the model can do? And what role will open access and open source language models have as commercial applications develop?
Artificial IntelligenceIf you’re reading this, chances are you’ve played around with using AI tools like ChatGPT or GitHub Copilot to write code for you. Or even if you haven’t yet, then you’ve at least heard about these tools in your newsfeed over the past year. So far I’ve read a gazillion blog posts about people’s experiences with these AI coding assistance tools. These posts often recount someone trying ChatGPT or Copilot for the first time with a few simple prompts, seeing how it does for some small self-contained coding tasks, and then making sweeping claims like “WOW this exceeded all my highest hopes and wildest dreams, it’s going to replace all programmers in five years!” or “ha look how incompetent it is … it couldn’t even get my simple question right!”
I really wanted to go beyond these quick gut reactions that I’ve seen so much of online, so I tried using ChatGPT for a few weeks to help me implement a hobby software project and took notes on what I found interesting. This article summarizes what I learned from that experience. The inspiration (and title) for it comes from Mike Loukides’ Radar article on Real World Programming with ChatGPT, which shares a similar spirit of digging into the potential and limits of AI tools for more realistic end-to-end programming tasks.
Setting the Stage: Who Am I and What Am I Trying to Build?I’m a professor who is interested in how we can use LLMs (Large Language Models) to teach programming. My student and I recently published a research paper on this topic, which we summarized in our Radar article Teaching Programming in the Age of ChatGPT. Our paper reinforces the growing consensus that LLM-based AI tools such as ChatGPT and GitHub Copilot can now solve many of the small self-contained programming problems that are found in introductory classes. For instance, problems like “write a Python function that takes a list of names, splits them by first and last name, and sorts by last name.” It’s well-known that current AI tools can solve these kinds of problems even better than many students can. But there’s a huge difference between AI writing self-contained functions like these and building a real piece of software end-to-end. I was curious to see how well AI could help students do the latter, so I wanted to first try doing it myself.
I needed a concrete project to implement with the help of AI, so I decided to go with an idea that had been in the back of my head for a while now: Since I read a lot of research papers for my job, I often have multiple browser tabs open with the PDFs of papers I’m planning to read. I thought it would be cool to play music from the year that each paper was written while I was reading it, which provides era-appropriate background music to accompany each paper. For instance, if I’m reading a paper from 2019, a popular song from that year could start playing. And if I switch tabs to view a paper from 2008, then a song from 2008 could start up. To provide some coherence to the music, I decided to use Taylor Swift songs since her discography covers the time span of most papers that I typically read: Her main albums were released in 2006, 2008, 2010, 2012, 2014, 2017, 2019, 2020, and 2022. This choice also inspired me to call my project Swift Papers.
Swift Papers felt like a well-scoped project to test how well AI handles a realistic yet manageable real-world programming task. Here’s how I worked on it: I subscribed to ChatGPT Plus and used the GPT-4 model in ChatGPT (first the May 12, 2023 version, then the May 24 version) to help me with design and implementation. I also installed the latest VS Code (Visual Studio Code) with GitHub Copilot and the experimental Copilot Chat plugins, but I ended up not using them much. I found it easier to keep a single conversational flow within ChatGPT rather than switching between multiple tools. Lastly, I tried not to search for help on Google, Stack Overflow, or other websites, which is what I would normally be doing while programming. In sum, this is me trying to simulate the experience of relying as much as possible on ChatGPT to get this project done.
Getting Started: Setup Trials and TribulationsHere’s the exact prompt I used to start my conversation with ChatGPT using GPT-4:
Act as a software developer to help me build something that will play music from a time period that matches when an academic paper I am reading in the browser was written.
I purposely kept this prompt high-level and underspecified since I wanted ChatGPT to guide me toward design and implementation ideas without me coming in with preconceived notions.
ChatGPT immediately suggested a promising direction—making a browser extension that gets the date of the research paper PDF in the currently-active tab and calls a music streaming API to play a song from that time period. Since I already had a YouTube Music account, I asked whether I could use it, but ChatGPT said that YouTube Music doesn’t have an API. We then brainstormed alternative ideas like using a browser automation tool to programmatically navigate and click on parts of the YouTube Music webpage. ChatGPT gave me some ideas along these lines but warned me that, “It’s important to note that while this approach doesn’t use any official APIs, it’s more brittle and more subject to break if YouTube Music changes their website structure. […] keep in mind that web scraping and browser automation can be complex, and handling all of the edge cases can be a significant amount of work. […] using APIs might be a more reliable and manageable solution.” That warning convinced me to drop this idea. I recalled that ChatGPT had recommended the Spotify Web API in an earlier response, so I asked it to teach me more about what it can do and tell me why I should use it rather than YouTube Music. It seemed like Spotify had what I needed, so I decided to go with it. I liked how ChatGPT helped me work through the tradeoffs of these initial design decisions before diving head-first into coding.
Next we worked together to set up the boilerplate code for a Chrome browser extension, which I’ve never made before. ChatGPT started by generating a manifest.json file for me, which holds the configuration settings that every Chrome extension needs. I didn’t know it at the time, but manifest.json would cause me a bunch of frustration later on. Specifically:
Wrestling with all these finicky details of manifest.json before I could begin any real coding felt like death by a thousand cuts. In addition, ChatGPT generated other starter code in the chat, which I copied into new files in my VS Code project:
Intermission 1: ChatGPT as a Personalized TutorAs shown above, a typical Chrome extension like mine has at least three JavaScript files: a background script, a content script, and a pop-up script. At this point I wanted to learn more about what all these files are meant to do rather than continuing to obediently copy-paste code from ChatGPT into my project. Specifically, I discovered that each file has different permissions for what browser or page components it can access, so all three must coordinate to make the extension work as intended. Normally I would read tutorials about how this all fits together, but the problem with tutorials is that they are not customized to my specific use case. Tutorials provide generic conceptual explanations and use made-up toy examples that I can’t relate to. So I end up needing to figure out how their explanations may or may not apply to my own context.
In contrast, ChatGPT can generate personalized tutorials that use my own Swift Papers project as the example in its explanations! For instance, when it explained to me what a content script does, it added that “For your specific project, a content script would be used to extract information (the publication date) from the academic paper’s webpage. The content script can access the DOM of the webpage, find the element that contains the publication date, and retrieve the date.” Similarly, it taught me that “Background scripts are ideal for handling long-term or ongoing tasks, managing state, maintaining databases, and communicating with remote servers. In your project, the background script could be responsible for communicating with the music API, controlling the music playback, and storing any data or settings that need to persist between browsing sessions.”
I kept asking ChatGPT follow-up questions to get it to teach me more nuances about how Chrome extensions worked, and it grounded its explanations in how those concepts applied to my Swift Papers project. To accompany its explanations, it also generated relevant example code that I could try out by running my extension. These explanations clicked well in my head because I was already deep into working on Swift Papers. It was a much better learning experience than, say, reading generic getting-started tutorials that walk through creating example extensions like “track your page reading time” or “remove clutter from a webpage” or “manage your tabs better” … I couldn’t bring myself to care about those examples since THEY WEREN’T RELEVANT TO ME! At the time, I cared only about how these concepts applied to my own project, so ChatGPT shined here by generating personalized mini-tutorials on-demand.
Another great side-effect of ChatGPT teaching me these concepts directly within our ongoing chat conversation is that whenever I went back to work on Swift Papers after a few days away from it, I could scroll back up in the chat history to review what I recently learned. This reinforced the knowledge in my head and got me back into the context of resuming where I last left off. To me, this is a huge benefit of a conversational interface like ChatGPT versus an IDE autocomplete interface like GitHub Copilot, which doesn’t leave a trace of its interaction history. Even though I had Copilot installed in VS Code as I was working on Swift Papers, I rarely used it (beyond simple autocompletions) since I liked having a chat history in ChatGPT to refer back to in later sessions.
Next Up: Choosing and Installing a Date Parsing LibraryIdeally Swift Papers would infer the date when an academic paper was written by analyzing its PDF file, but that seemed too hard to do since there isn’t a standard place within a PDF where the publication date is listed. Instead what I decided to do was to parse the “landing pages” for each paper that contains metadata such as its title, abstract, and publication date. Many papers I read are linked from a small handful of websites, such as the ACM Digital Library, arXiv, or Google Scholar, so I could parse the HTML of those landing pages to extract publication dates. For instance, here’s the landing page for the classic Beyond being there paper:
I wanted to parse the “Published: 01 June 1992” string on that page to get 1992 as the publication year. I could’ve written this code by hand, but I wanted to try using a JavaScript date parsing library since it would be more robust to date format variations that appear on various websites (e.g., using “22” for the year 2022). Also, since any real-world software project will need to use external libraries, I wanted to see how well ChatGPT could help me choose and install libraries.
ChatGPT suggested two libraries for me: Moment.js and chrono-node. However, it warned me about Moment.js: “as of September 2020, it is considered a legacy project and not recommended for new projects as the team is not planning on doing any new development or maintenance.” I verified this was true by seeing the same warning on the Moment.js homepage. But still, I liked how Moment.js was available as a single self-contained file that I could directly include into my extension without using a package manager like npm or a bundler like webpack (the fewer external tools I needed to set up, the better!). Or so I thought … ChatGPT led me to believe that I could get by without npm and webpack, but later I discovered that this only works in the old Manifest v2 extension format that is no longer supported by Chrome. It turns out that with Manifest v3, all external libraries must be bundled together using tools like npm and webpack in order for the extension to import them. So it goes… once more I got bitten by the fact that ChatGPT was biased toward producing outdated information for v2 rather than the newer v3, presumably because there was a lot more information about v2 in its training data.
Since I had to use npm and webpack anyways, I decided to go with chrono-node since it seemed more robust and up-to-date (no pun intended). I had ChatGPT help me set up my webpack configuration file (webpack.config.js) and things almost seemed to work, except that I got a mysterious error. When I pasted the error message into ChatGPT, it correctly diagnosed the problem, which was something related to (surprise surprise!) webpack’s default settings not being compatible with the security restrictions of my extension’s Manifest v3 format. It also suggested a working fix to webpack.config.js:
This error message is related to the Content Security Policy (CSP) of Chrome extensions. […] By default, Manifest V3 disallows the use of eval() and the Function() constructor, which are typically used for evaluating or creating code from strings […] Webpack, by default, includes a small amount of inline JavaScript code for handling hot module replacement and some other features. This code uses eval() which is not allowed in the context of a Chrome extension. To solve this, you can adjust your webpack configuration to output code in a format that doesn’t rely on eval(). In your webpack.config.js, you can set the devtool option to ‘none’ or use the ‘source-map’ setting […]
Here again ChatGPT showed me that it clearly knew what the problem was (since it told me after I fed it the error message!) and how to fix it. So why didn’t it produce the correct webpack configuration file in the first place?
More generally, several times I’ve seen ChatGPT produce code that I felt might be incorrect. Then when I tell it that there might be a bug in a certain part, it admits its mistake and produces the correct code in response. If it knew that its original code was incorrect, then why didn’t it generate the correct code in the first place?!? Why did I have to ask it to clarify before it admitted its mistake? I’m not an expert at how LLMs work internally, but my layperson guess is that it may have to do with the fact that ChatGPT generates code linearly one token at a time, so it may get ‘stuck’ near local maxima (with code that mostly works but is incorrect in some way) while it is navigating the enormous abstract space of possible output code tokens; and it can’t easily backtrack to correct itself as it generates code in a one-way linear stream. But after it finishes generating code, when the user asks it to review that code for possible errors, it can now “see” and analyze all of that code at once. This comprehensive view of the code may enable ChatGPT to find bugs better, even if it couldn’t avoid introducing those bugs in the first place due to how it incrementally generates code in a one-way stream. (This isn’t an accurate technical explanation, but it’s how I informally think about it.)
Intermission 2: ChatGPT as a UX Design ConsultantNow that I had a basic Chrome extension that could extract paper publication dates from webpages, the next challenge was using the Spotify API to play era-appropriate Taylor Swift songs to accompany these papers. But before embarking on another coding-intensive adventure, I wanted to switch gears and think more about UX (user experience). I got so caught up in the first few hours of getting my extension set up that I hadn’t thought about how this app ought to work in detail. What I needed at this time was a UX design consultant, so I wanted to see if ChatGPT could play this role.
Note that up until now I had been doing everything in one long-running chat session that focused on coding-related questions. That was great because ChatGPT was fully “in the zone” and had a very long conversation (spanning several hours over multiple days) to use as context for generating code suggestions and technical explanations. But I didn’t want all that prior context to influence our UX discussion, so I decided to begin again by starting a brand-new session with the following prompt:
You are a Ph.D. graduate in Human-Computer Interaction and now a senior UX (user experience) designer at a top design firm. Thus, you are very familiar with both the experience of reading academic papers in academia and also designing amazing user experiences in digital products such as web applications. I am a professor who is creating a Chrome Extension for fun in order to prototype the following idea: I want to make the experience of reading academic papers more immersive by automatically playing Taylor Swift songs from the time period when each paper was written while the reader is reading that particular paper in Chrome. I have already set up all the code to connect to the Spotify Web API to programmatically play Taylor Swift songs from certain time periods. I have also already set up a basic Chrome Extension that knows what webpages the user has open in each tab and, if it detects that a webpage may contain metadata about an academic paper then it parses that webpage to get the year the paper was written in, in order to tell the extension what song to play from Spotify. That is the basic premise of my project.
Your job is to serve as a UX design consultant to help me design the user experience for such a Chrome Extension. Do not worry about whether it is feasible to implement the designs. I am an experienced programmer so I will tell you what ideas are or are not feasible to implement. I just want your help with thinking through UX design.
As our session progressed, I was very impressed with ChatGPT’s ability to help me brainstorm how to handle different user interaction scenarios. That said, I had to give it some guidance upfront using my knowledge of UX design: I started by asking it to come up with a few user personas and then to build up some user journeys for each. Given this initial prompting, ChatGPT was able to help me come up with practical ideas that I didn’t originally consider all too well, especially for handling unusual edge cases (e.g., what should happen to the music when the user switches between tabs very quickly?). The back-and-forth conversational nature of our chat made me feel like I was talking to a real human UX design consultant.
I had a lot of fun working with ChatGPT to refine my initial high-level ideas into a detailed plan for how to handle specific user interactions within Swift Papers. The culmination of our consulting session was ChatGPT generating ASCII diagrams of user journeys through Swift Papers, which I could later refer to when implementing this logic in code. Here’s one example:
Reflecting back, this session was productive because I was familiar enough with UX design concepts to steer the conversation towards more depth. Out of curiosity, I started a new chat session with exactly the same UX consultant prompt as above but then played the part of a total novice instead of guiding it:
I don’t know anything about UX design. Can you help me get started since you are the expert?
The conversation that followed this prompt was far less useful since ChatGPT ended up giving me a basic primer on UX Design 101 and offering high-level suggestions for how I can start thinking about the user experience of Swift Papers. I didn’t want to nudge it too hard since I was pretending to be a novice, and it wasn’t proactive enough to ask me clarifying questions to probe deeper. Perhaps if I had prompted it to be more proactive at the start, then it could have elicited more information even from a novice.
This digression reinforces the widely-known consensus that what you get out of LLMs like ChatGPT is only as good as the prompts you’re able to put in. There’s all of this relevant knowledge hiding inside its neural network mastermind of billions and billions of LLM parameters, but it’s up to you to coax it into revealing what it knows by taking the lead in conversations and crafting the right prompts to direct it toward useful responses. Doing so requires a degree of expertise in the domain you’re asking about, so it’s something that beginners would likely struggle with.
The Last Big Hurdle: Working with the Spotify APIAfter ChatGPT helped me with UX design, the last hurdle I had to overcome was figuring out how to connect my Chrome extension to the Spotify Web API to select and play music. Like my earlier adventure with installing a date parsing library, connecting to web APIs is another common real-world programming task, so I wanted to see how well ChatGPT could help me with it.
The gold standard here is an expert human programmer who has a lot of experience with the Spotify API and who is good at teaching novices. ChatGPT was alright for getting me started but ultimately didn’t meet this standard. My experience here showed me that human experts still outperform the current version of ChatGPT along the following dimensions:
In the end I got this Spotify API setup working by doing some old-fashioned web searching to supplement my ChatGPT conversation. (I did try the ChatGPT + Bing web search plugin for a bit, but it was slow and didn’t produce useful results, so I couldn’t tolerate it any more and just shut it off.) The breakthrough came as I was browsing a GitHub repository of Spotify Web API example code. I saw an example for Node.js that seemed to do what I wanted, so I copy-pasted that code snippet into ChatGPT and told it to adapt the example for my Swift Papers app (which isn’t using Node.js):
Here’s some example code using Implicit Grant Flow from Spotify’s documentation, which is for a Node.js app. Can you adapt it to fit my chrome extension? [I pasted the code snippet here]
ChatGPT did a good job at “translating” that example into my context, which was exactly what I needed at the moment to get unstuck. The code it generated wasn’t perfect, but it was enough to start me down a promising path that would eventually lead me to get the Spotify API working for Swift Papers. Reflecting back, I later realized that I had manually done a simple form of RAG (Retrieval Augmented Generation) here by using my intuition to retrieve a small but highly-relevant snippet of example code from the vast universe of all code on the internet and then asking a super-specific question about it. (However, I’m not sure a beginner would be able to scour the web to find such a relevant piece of example code like I did, so they would probably still be stuck at this step because ChatGPT alone wasn’t able to generate working code without this extra push from me.)
Epilogue: What Now?I have a confession: I didn’t end up finishing Swift Papers. Since this was a hobby project, I stopped working on it after about two weeks when my day-job got more busy. However, I still felt like I completed the initial hard parts and got a sense of how ChatGPT could (and couldn’t) help me along the way. To recap, this involved:
After laying this groundwork, I was able to start getting into the flow of an edit-run-debug cycle where I knew exactly where to add code to implement a new feature, how to run it to assess whether it did what I intended, and how to debug. So even though I stopped working on this project due to lack of time, I got far enough to see how completing Swift Papers would be “just a matter of programming.” Note that I’m not trying to trivialize the challenges involved in programming, since I’ve done enough of it to know that the devil is in the details. But these coding-specific details are exactly where AI tools like ChatGPT and GitHub Copilot shine! So even if I had continued adding features throughout the coming weeks, I don’t feel like I would’ve gotten any insights about AI tools that differ from what many others have already written about. That’s because once the software environment has been set up (e.g., libraries, frameworks, build systems, permissions, API authentication keys, and other plumbing to hook things together), then the task at hand reduces to a self-contained and well-defined programming problem, which AI tools excel at.
In sum, my goal in writing this article was to share my experiences using ChatGPT for the more open-ended tasks that came before my project turned into “just a matter of programming.” Now, some may argue that this isn’t “real” programming since it feels like just a bunch of mundane setup and configuration work. But I believe that if “real-world” programming means creating something realistic with code, then “real-real-world” programming (the title of this article!) encompasses all these tedious and idiosyncratic errands that are necessary before any real programming can begin. And from what I’ve experienced so far, this sort of work isn’t something humans can fully outsource to AI tools yet. Long story short, someone today can’t just give AI a high-level description of Swift Papers and have a robust piece of software magically pop out the other end. I’m sure people are now working on the next generation of AI that can bring us closer to this goal (e.g., much longer context windows with Claude 2 and retrieval augmented generation with Cody), so I’m excited to see what’s in store. Perhaps future AI tool developers could use Swift Papers as a benchmark to assess how well their tool performs on an example real-real-world programming task. Right now, widely-used benchmarks for AI code generation (e.g., HumanEval, MBPP) consist of small self-contained tasks that appear in introductory classes, coding interviews, or programming competitions. We need more end-to-end, real-world benchmarks to drive improvements in these AI tools.
Lastly, switching gears a bit, I also want to think more in the future about how AI tools can teach novices the skills they need to create realistic software projects like Swift Papers rather than doing all the implementation work for them. At present, ChatGPT and Copilot are reasonably good “doers” but not nearly as good at being teachers. This is unsurprising since they were designed to carry out instructions like a good assistant would, not to be an effective teacher who provides pedagogically-meaningful guidance. With the proper prompting and fine-tuning, I’m sure they can do much better here, and organizations like Khan Academy are already customizing GPT-4 to become a personalized tutor. I’m excited to see how things progress in this fast-moving space in the coming months and years. In the meantime, for more thoughts about AI coding tools in education, check out this other recent Radar article that I co-authored, Teaching Programming in the Age of ChatGPT, which summarizes our research paper about this topic.