You are here

Feed aggregator

Radar Trends to Watch: November 2023

O'Reilly Radar - Tue, 2023/11/07 - 03:58

Our Security section has grown almost as large as AI (and longer than Programming)—and that’s not including some security issues specific to AI, like model leeching. Does that mean that AI is cooling down? Or that security is heating up? It’s really impossible for security issues to get too much attention. The biggest news in AI arrived on the last day of October, and it wasn’t technical at all: the Biden administration’s executive order on AI. It will take some time to digest this, and even longer to see whether vendors follow the order’s recommendations. In itself, it’s evidence of an important ongoing trend: in the next year, many of the most important developments in AI will be legal rather than technical.

Artificial Intelligence
  • In an executive order, the US has issued a set of rules covering the development of advanced AI systems. The regulations encourage the development of watermarks (specifically the C2PA initiative) to authenticate communication; they attempt to set standards for testing; and they call for agencies to develop rules to protect consumers and workers.
  • Nightshade is another tool that artists can use to prevent generative AI systems from using their work. It makes unnoticeable modifications to the image that cause the AI model to misinterpret it and create incorrect output.
  • Stanford’s Institute for Human-Centered Artificial Intelligence has issued a report on transparency for large language models: whether the creators of LLMs are disclosing essential data about their models. No model scores well, and transparency appears to be declining as the field grows more competitive.
  • Chatbots perpetuate false and racially biased information in medical care. Debunked ideas about pain tolerance, kidney function, and other factors are included in training data, causing models to repeat those ideas.
  • An AI Bill of Materials (AIBOM) would document all of the materials that go into the creation of an AI system. This documentation would be essential to building AI that is capable of complying with regulation.
  • GPT-4 does Stephenson: GPT simulates the Young Lady’s Illustrated Primer (from The Diamond Age). With illustrations from DALL-E.
  • Step-Back Prompting is another prompting technique in which you ask a question, but before getting an answer, you ask the LLM to provide background information that will help it answer the question.
  • Prompt injection just got scarier. GPT-4V, which allows users to include images in conversations, is vulnerable to prompt injection through the images themselves; text in the images can be interpreted as prompts. Malicious prompts can even be hidden in images.
  • Google joins Microsoft, Adobe, and others in indemnifying users of their AI against copyright lawsuits.
  • Model leeching is a new attack against large language models. In model leeching, a carefully constructed set of prompts allows attackers to generate a smaller model that behaves similarly. The smaller model can then be used to construct other attacks against the original model.
  • Open source language models are proliferating. Replit Code v1.5 3B is now available on Hugging Face. This model is designed for code completion, and has been trained on permissively licensed code so there should be minimal legal issues.
  • Anthropic appears to have made significant progress in making large language models interpretable. The key is understanding the behavior of groups of neurons, which they call “features,” rather than individual neurons.
  • Mistral 7B is an open source large language model with impressive performance. It was developed independently. (It is not related to LLaMA.) Its performance is claimed to be better than equivalently sized models.
  • AMD may be able to challenge NVIDIA’s dominance of the GPU market. NVIDIA’s dominance relies on the widely used CUDA language for programming GPUs. AMD has developed a version of PyTorch that has been tuned for use on AMD GPUs, eliminating the need for low-level GPU programming.
  • Larger training datasets leads to more biased and hateful output, not less.
  • LangStream (unrelated to LangChain) is an open source platform for building streaming applications that use generative AI.
  • GPT-4 and Claude have proven useful in translating 16th century demonology texts written in Medieval Latin. Claude’s 100K context window appears to be a big help. (And Medieval Latin is much different from the Latin you probably didn’t learn in school.)
  • A vulnerability called ShellTorch allows attackers to gain access to AI servers using TorchServe, a tool for deploying and scaling AI models using PyTorch.
  • Reservoir computing is another kind of neural network that has promise for understanding chaotic systems.
  • Perhaps not surprisingly, language models can do an excellent job of lossless compression better than standards like FLAC. (This doesn’t mean that language models store a compressed copy of the web.)
  • An artist makes the case that training generative models not to “hallucinate” has made them less interesting and less useful for creative applications.
  • Can you melt eggs? Quora has included a feature that generates answers using an older GPT model. This model answered “yes,” and aggressive SEO managed to get that “yes” to the top of a Google search.
  • Harpoon is a no-code, drag and drop tool for Kubernetes deployment.
  • Cackle is a new tool for the Rust tool chain. It checks access control lists and is used to make software supply chain attacks more difficult.
  • Correctness SLOs (Service-Level Objectives) are a way to specify the statistical properties of a program’s output if it is running properly. They could become important as AI is integrated into more applications.
  • Cilium is a tool for cloud native network observability. It provides a layer on top of eBPF that solves security and observability problems for Docker and Kubernetes workloads.
  • The Six Pillars of Platform Engineering is a great start for any organization that is serious about developer experience. The pillars are Security, Pipelines, Provisioning, Connectivity, Orchestration, and Observability. One article in this series is devoted to each.
  • Adam Jacob, creator of Chef Software, is out to reimagine DevOps. System Initiative is an open source tool for managing infrastructure that stresses collaboration between engineers and operations staff—something that was always the goal of DevOps but rarely achieved.
  • Unreal engine, a game development platform that had been free for users outside of the gaming industry, will now have a subscription fee. It will remain free for students and educators.
  • CRDTs (conflict-free replicated data types) are a data structure that is designed for resolving concurrent changes in collaborative applications (like Google Docs). Here’s a good interactive tutorial and a project: building a collaborative pixel editor.
  • Ambient is a purely web-based platform for multiplayer games, built with Wasm, WebGPU, and Rust. Instant deployment, no servers.
  • Google has open sourced its graph mining library. Graphs are becoming increasingly important in data mining and machine learning.
  • Microsoft has released a binary build of OpenJDK 21, presumably optimized for Azure. Shades of Embrace and Extend? That doesn’t appear to be happening.
  • Polystores can store many different kinds of data—relational data, vector data, unstructured data, graph data—in a single data management system.
  • The EFF has posted an excellent introduction to passkeys, which are the next step past passwords in user authentication.
  • Microsoft has started an early access program for Security Copilot, a chatbot based on GPT-4 that has been tuned to answer questions about computer security. It can also summarize data from security incidents, analyze data from new attacks, and suggest responses.
  • Google is planning to test IP protection in Chrome. IP protection hides users’ IP addresses by routing traffic to or from specific domains through proxies. Address hiding prevents a number of common attacks, including cross-site scripting.
  • While the European Cyber Resilience Act (CRA) has many good ideas about making software more secure, it puts liability for software flaws on open source developers and companies funding open source development.
  • A new attack against memory, called RowPress, can cause bitflips even in DDR4 memory, which already incorporates protections against the RowHammer attack.
  • August and September’s distributed denial of service attacks (DDOS) against Cloudflare and Google took advantage of a newly discovered vulnerability in HTTP/2. Attackers open many streams per request, creating extremely high utilization with relatively few connections.
  • Mandiant has provided a fascinating analysis of the Russian military intelligence’s (GRU’s) playbook in Ukraine.
  • Mozilla and Fastly are developing OHTTP (Oblivious HTTP), a successor to HTTP that has been designed for privacy. OHTTP separates information about the requestor from the request itself, so no single party ever has both pieces of information.
  • A newly discovered backdoor to WordPress allows attackers to take over websites. The malware is disguised as a WordPress plug-in that appears legitimate.
  • While standards are still developing, decentralized identity and verifiable credentials are starting to appear outside of the cryptocurrency world. When adopted, these technologies will significantly enhance both privacy and security.
  • To improve its ability to detect unwanted and harmful email, Gmail will be requiring bulk email senders (over 5,000 messages per day) to implement SPF, DKIM, and DMARC authentication records in DNS or risk having their messages marked as spam.
  • Genetic data has been stolen from 23andMe. The attack was quite simple: the attackers just used usernames and passwords that were in circulation and had been reused.
  • The time required to execute a ransomware attack has reduced from 10 days to 2 days, and it’s increasingly common for victims to be hit with a second attack against systems that have already been compromised.
  • Toxiproxy is a tool for chaos network engineering. It is a proxy server that simulates many kinds of network misbehavior.
  • Network neutrality rises again: The chair of the FCC has proposed returning to Obama-era network neutrality rules, in which carriers couldn’t prioritize traffic from some users in exchange for payment. Laws in some states, such as California, have largely prevented traffic prioritization, but a return of network neutrality would provide a uniform regulatory framework.
  • Most VPNs (even VPNs that don’t log traffic) track user activity. Obscura is a new VPN that was designed for privacy, and that cannot track activity.
  • The US Fish & Wildlife Service is creating a biodiversity library. The library’s goal is to preserve tissue samples from all endangered species in the US. The animals’ DNA will be sequenced and uploaded to GenBank, a collection of all publicly available DNA sequences.
Quantum Computing
  • Atom Computing claims to have built a 1,000 qubit quantum computer. While this is still too small to do real work, it’s the largest quantum computer we know about; it looks like it can scale to (somewhat) larger sizes; and it doesn’t require extreme cold.
  • Two research teams have made progress in quantum error correction. Lately, we’ve seen several groups reporting progress in QEC, which is key to making quantum computing practical. Will this soon be a solved problem?
  • This article’s title is all you need: Boston Dynamics turned its robotic dog into a walking tour guide using ChatGPT. It can give a tour of Boston Dynamics’ facilities in which it answers questions, using data from its cameras to provide added context. And it has a British accent.
  • Another autonomous robotic dog can plan and execute actions in complex environments. While its agility is impressive, what sets it apart is the ability to plan actions to achieve a goal, taking into account the objects that it sees.
  • A tetrahedral robot is able to change its shape and size, use several different styles of walking, and adapt itself to different tasks.
Categories: Technology

Questions for 2024

O'Reilly Radar - Tue, 2023/10/31 - 03:11

This time of year, everyone publishes predictions. They’re fun, but I don’t find them a good source of insight into what’s happening in technology.

Instead of predictions, I’d prefer to look at questions: What are the questions to which I’d like answers as 2023 draws to a close? What are the unknowns that will shape 2024? That’s what I’d really like to know. Yes, I could flip a coin or two and turn these into predictions, but I’d rather leave them open-ended. Questions don’t give us the security of an answer. They force us to think, and to continue thinking. And they let us pose problems that we really can’t think about if we limit ourselves to predictions like “While individual users are getting bored with ChatGPT, enterprise use of Generative AI will continue to grow.” (Which, as predictions go, is pretty good.)

The Lawyers Are Coming

The year of tech regulation: Outside of the EU, we may be underwhelmed by the amount of proposed regulation that becomes law. However, discussion of regulation will be a major pastime of the chattering classes, and major technology companies (and venture capital firms) will be maneuvering to ensure that regulation benefits them. Regulation is a double-edged sword: while it may limit what you can do, if compliance is difficult, it gives established companies an advantage over smaller competition.

Three specific areas need watching:

  • What regulations will be proposed for AI? Many ideas are in the air; watch for changes in copyright law, privacy, and harmful use.
  • What regulations will be proposed for “online safety”? Many of the proposals we’ve seen are little more than hidden attacks against cryptographically secure communications.
  • Will we see more countries and states develop privacy regulations? The EU has led with GDPR. However, effective privacy regulation comes into direct conflict with online safety, as those ideas are often formulated. Which will win out?

Organized labor: Unions are back. How will this affect technology? I doubt that we’ll see strikes at major technology companies like Google and Amazon—but we’ve already seen a union at Bandcamp. Could this become a trend? X (Twitter) employees have plenty to be unhappy about, though many of them have immigration complications that would make unionization difficult.

The backlash against the backlash against open source: Over the past decade, a number of corporate software projects have changed from an open source license, such as Apache, to one of a number of “business source” licenses. These licenses vary, but typically restrict users from competing with the project’s vendor. When HashiCorp relicensed their widely used Terraform product as business source, their community’s reaction was strong and immediate. They formed an OpenTF consortion and forked the last open source version of Terraform, renaming it OpenTofu; OpenTofu was quickly adopted under the Linux Foundation’s mantle and appears to have significant traction among developers. In response, HashiCorp’s CEO has predicted that the rejection of business source licenses will be the end of open source.

  • As more corporate sponsors adopt business sources licenses, will we see more forks?
  • Will OpenTofu survive in competition with Terraform?

A decade ago, we said that open source has won. More recently, developers questioned open source’s relevance in an era of web giants. In 2023, the struggle resumed. By the end of 2024, we’ll know a lot more about the answers to these questions.

Simpler, Please

Kubernetes: Everyone (well, almost everyone) is using Kubernetes to orchestrate large applications that are running in the cloud. And everyone (well, almost everyone) thinks Kubernetes is too complex. That’s no doubt true; prior to its release as an open source project, Kubernetes was Google’s Borg, the almost legendary software that ran their core applications. Kubernetes was designed for Google-scale deployments, but very few organizations need that.

We’ve long thought that a simpler alternative to Kubernetes would arrive. We haven’t seen it. We have seen some simplifications built on top of Kubernetes: K3s is one; Harpoon is a no-code drag-and-drop tool for managing Kubernetes. And all the major cloud providers offer “managed Kubernetes” services that take care of Kubernetes for you.

So our questions about container orchestration are:

  • Will we see a simpler alternative that succeeds in the marketplace? There are some alternatives out there now, but they haven’t gained traction.
  • Are simplification layers on top of Kubernetes enough? Simplification usually comes with limitations: users find most of what they want but frequently miss one feature they need.

From microservices to monolith: While microservices have dominated the discussion of software architecture, there have always been other voices arguing that microservices are too complex, and that monolithic applications are the way to go. Those voices are becoming more vocal. We’ve heard lots about organizations decomposing their monoliths to build collections of microservices—but in the past year we’ve heard more about organizations going the other way. So we need to ask:

  • Is this the year of the monolith?
  • Will the “modular monolith” gain traction?
  • When do companies need microservices?
Securing Your AI

AI systems are not secure: Large language models are vulnerable to new attacks like prompt injection, in which adversarial input directs the model to ignore its instructions and produce hostile output. Multimodal models share this vulnerability: it’s possible to submit an image with an invisible prompt to ChatGPT and corrupt its behavior. There is no known solution to this problem; there may never be one.

With that in mind, we have to ask:

  • When will we see a major, successful hostile attack against generative AI? (I’d bet it will happen before the end of 2024. That’s a prediction. The clock is ticking.)
  • Will we see a solution to prompt injection, data poisoning, model leakage, and other attacks?
Not Dead Yet

The metaverse: It isn’t dead, but it’s not what Zuckerberg or Tim Cook thought. We’ll discover that the metaverse isn’t about wearing goggles, and it certainly isn’t about walled-off gardens. It’s about better tools for collaboration and presence. While this isn’t a big trend, we’ve seen an upswing in developers working with CRDTs and other tools for decentralized frictionless collaboration.

NFTs: NFTs are a solution looking for a problem. Enabling people with money to prove they can spend their money on bad art wasn’t a problem many people wanted to solve. But there are problems out there that they could solve, such as maintaining public records in an open immutable database. Will NFTs actually be used to solve any of these problems?

Categories: Technology

Preliminary Thoughts on the White House Executive Order on AI

O'Reilly Radar - Mon, 2023/10/30 - 13:36

Disclaimer: Based on the announcement of the EO, without having seen the full text.

Overall, the Executive Order is a great piece of work, displaying a great deal of both expertise and thoughtfulness. It balances optimism about the potential of AI with reasonable consideration of the risks. And it doesn’t rush headlong into new regulations or the creation of new agencies, but instead directs existing agencies and organizations to understand and apply AI to their mission and areas of oversight. The EO also does an impressive job of highlighting the need to bring more AI talent into government. That’s a huge win.

Given my own research focus on enhanced disclosures as the starting point for better AI regulation, I was heartened to hear that the Executive Order on AI uses the Defense Production Act to compel disclosure of various data from the development of large AI models. Unfortunately, these disclosures do not go far enough. The EO seems to be requiring only data on the procedures and results of “Red Teaming” (i.e. adversarial testing to determine a model’s flaws and weak points), and not a wider range of information that would help to address many of the other concerns outlined in the EO. These include:

  • What data sources the model is trained on. Availability of this information would assist in many of the other goals outlined in the EO, including addressing algorithmic discrimination and increasing competition in the AI market, as well as other important issues that the EO does not address, such as copyright. The recent discovery (documented by an exposé in The Atlantic) that OpenAI, Meta, and others used databases of pirated books, for example, highlights the need for transparency in training data. Given the importance of intellectual property to the modern economy, copyright ought to be an important part of this executive order. Transparency on this issue will not only allow for debate and discussion of the intellectual property issues raised by AI, it will increase competition between developers of AI models to license high-quality data sources and to differentiate their models based on that quality. To take one example, would we be better off with the medical or legal advice from an AI that was trained only with the hodgepodge of knowledge to be found on the internet, or one trained on the full body of professional information on the topic?
  • Operational Metrics. Like other internet-available services, AI models are not static artifacts, but dynamic systems that interact with their users. AI companies deploying these models manage and control them by measuring and responding to various factors, such as permitted, restricted, and forbidden uses; restricted and forbidden users; methods by which its policies are enforced; detection of machine-generated content, prompt-injection, and other cyber-security risks; usage by geography, and if measured, by demographics and psychographics; new risks and vulnerabilities identified during operation that go beyond those detected in the training phase; and much more. These should not be a random grab-bag of measures thought up by outside regulators or advocates, but disclosures of the actual measurements and methods that the companies use to manage their AI systems.
  • Policy on use of user data for further training. AI companies typically treat input from their users as additional data available for training. This has both privacy and intellectual property implications.
  • Procedures by which the AI provider will respond to user feedback and complaints. This should include its proposed redress mechanisms.
  • Methods by which the AI provider manages and mitigates risks identified via Red Teaming, including their effectiveness. This reporting should not just be “once and done,” but an ongoing process that allows the researchers, regulators, and the public to understand whether the models are improving or declining in their ability to manage the identified new risks.
  • Energy usage and other environmental impacts. There has been a lot of fear-mongering about the energy costs of AI and its potential impact in a warming world. Disclosure of the actual amount of energy used for training and operating AI models would allow for a much more reasoned discussion of the issue.

These are only a few off-the-cuff suggestions. Ideally, once a full range of required disclosures has been identified, they should be overseen by either an existing governmental standards body, or a non-profit akin to the Financial Accounting Standards Board (FASB) that oversees accounting standards. This is a rapidly-evolving field, and so disclosure is not going to be a “once-and-done” kind of activity. We are still in the early stages of the AI era, and innovation should be allowed to flourish. But this places an even greater emphasis on the need for transparency, and the establishment of baseline reporting frameworks that will allow regulators, investors, and the public to measure how successfully AI developers are managing the risks, and whether AI systems are getting better or worse over time.


After reading the details found in the full Executive Order on AI, rather than just the White House summary, I am far less positive about the impact of this order, and what appeared to be the first steps towards a robust disclosure regime, which is a necessary precursor to effective regulation. The EO will have no impact on the operations of current AI services like ChatGPT, Bard, and others under current development, since its requirements that model developers disclose the results of their “red teaming” of model behaviors and risks only apply to future models trained with orders of magnitude more compute power than any current model. In short, the AI companies have convinced the Biden Administration that the only risks worth regulating are the science-fiction existential risks of far future AI rather than the clear and present risks in current models.

It is true that various agencies have been tasked with considering present risks such as discrimination in hiring, criminal justice applications, and housing, as well as impacts on the job market, healthcare, education, and competition in the AI market, but those efforts are in their infancy and years off. The most important effects of the EO, in the end, turn out to be the call to increase hiring of AI talent into those agencies, and to increase their capabilities to deal with the issues raised by AI. Those effects may be quite significant over the long run, but they will have little short-term impact.

In short, the big AI companies have hit a home run in heading off any effective regulation for some years to come.

Categories: Technology

Model Collapse: An Experiment

O'Reilly Radar - Tue, 2023/10/24 - 03:07

Ever since the current craze for AI-generated everything took hold, I’ve wondered: what will happen when the world is so full of AI-generated stuff (text, software, pictures, music) that our training sets for AI are dominated by content created by AI. We already see hints of that on GitHub: in February 2023, GitHub said that 46% of all the code checked in was written by Copilot. That’s good for the business, but what does that mean for future generations of Copilot? At some point in the near future, new models will be trained on code that they have written. The same is true for every other generative AI application: DALL-E 4 will be trained on data that includes images generated by DALL-E 3, Stable Diffusion, Midjourney, and others; GPT-5 will be trained on a set of texts that includes text generated by GPT-4; and so on. This is unavoidable. What does this mean for the quality of the output they generate? Will that quality improve or will it suffer?

I’m not the only person wondering about this. At least one research group has experimented with training a generative model on content generated by generative AI, and has found that the output, over successive generations, was more tightly constrained, and less likely to be original or unique. Generative AI output became more like itself over time, with less variation. They reported their results in “The Curse of Recursion,” a paper that’s well worth reading. (Andrew Ng’s newsletter has an excellent summary of this result.)

I don’t have the resources to recursively train large models, but I thought of a simple experiment that might be analogous. What would happen if you took a list of numbers, computed their mean and standard deviation, used those to generate a new list, and did that repeatedly? This experiment only requires simple statistics—no AI.

Although it doesn’t use AI, this experiment might still demonstrate how a model could collapse when trained on data it produced. In many respects, a generative model is a correlation engine. Given a prompt, it generates the word most likely to come next, then the word mostly to come after that, and so on. If the words “To be” pop out, the next word is reasonably likely to be “or”; the next word after that is even more likely to be “not”; and so on. The model’s predictions are, more or less, correlations: what word is most strongly correlated with what came before? If we train a new AI on its output, and repeat the process, what is the result? Do we end up with more variation, or less?

To answer these questions, I wrote a Python program that generated a long list of random numbers (1,000 elements) according to the Gaussian distribution with mean 0 and standard deviation 1. I took the mean and standard deviation of that list, and use those to generate another list of random numbers. I iterated 1,000 times, then recorded the final mean and standard deviation. This result was suggestive—the standard deviation of the final vector was almost always much smaller than the initial value of 1. But it varied widely, so I decided to perform the experiment (1,000 iterations) 1,000 times, and average the final standard deviation from each experiment. (1,000 experiments is overkill; 100 or even 10 will show similar results.)

When I did this, the standard deviation of the list gravitated (I won’t say “converged”) to roughly 0.45; although it still varied, it was almost always between 0.4 and 0.5. (I also computed the standard deviation of the standard deviations, though this wasn’t as interesting or suggestive.) This result was remarkable; my intuition told me that the standard deviation wouldn’t collapse. I expected it to stay close to 1, and the experiment would serve no purpose other than exercising my laptop’s fan. But with this initial result in hand, I couldn’t help going further. I increased the number of iterations again and again. As the number of iterations increased, the standard deviation of the final list got smaller and smaller, dropping to .0004 at 10,000 iterations.

I think I know why. (It’s very likely that a real statistician would look at this problem and say “It’s an obvious consequence of the law of large numbers.”) If you look at the standard deviations one iteration at a time, there’s a lot a variance. We generate the first list with a standard deviation of one, but when computing the standard deviation of that data, we’re likely to get a standard deviation of 1.1 or .9 or almost anything else. When you repeat the process many times, the standard deviations less than one, although they aren’t more likely, dominate. They shrink the “tail” of the distribution. When you generate a list of numbers with a standard deviation of 0.9, you’re much less likely to get a list with a standard deviation of 1.1—and more likely to get a standard deviation of 0.8. Once the tail of the distribution starts to disappear, it’s very unlikely to grow back.

What does this mean, if anything?

My experiment shows that if you feed the output of a random process back into its input, standard deviation collapses. This is exactly what the authors of “The Curse of Recursion” described when working directly with generative AI: “the tails of the distribution disappeared,” almost completely. My experiment provides a simplified way of thinking about collapse, and demonstrates that model collapse is something we should expect.

Model collapse presents AI development with a serious problem. On the surface, preventing it is easy: just exclude AI-generated data from training sets. But that’s not possible, at least now because tools for detecting AI-generated content have proven inaccurate. Watermarking might help, although watermarking brings its own set of problems, including whether developers of generative AI will implement it. Difficult as eliminating AI-generated content might be, collecting human-generated content could become an equally significant problem. If AI-generated content displaces human-generated content, quality human-generated content could be hard to find.

If that’s so, then the future of generative AI may be bleak. As the training data becomes ever more dominated by AI-generated output, its ability to surprise and delight will diminish. It will become predictable, dull, boring, and probably no less likely to “hallucinate” than it is now. To be unpredictable, interesting, and creative, we still need ourselves.

Categories: Technology

Prompting Isn’t The Most Important Skill

O'Reilly Radar - Tue, 2023/10/17 - 03:21

Anant Agarwal, an MIT professor and of the founders of the EdX educational platform, recently created a stir by saying that prompt engineering was the most important skill you could learn. And that you could learn the basics in two hours.

Although I agree that designing good prompts for AI is an important skill, Agarwal overstates his case. But before discussing why, it’s important to think about what prompt engineering means.

Attempts to define prompt engineering fall into two categories:

  • Coming up with clever prompts to get an AI to do what you want while sitting at your laptop. This definition is essentially interactive. It’s arguable whether this should be called “engineering”; at this point, it’s more of an art than an applied science. This is probably the definition that Agarwal has in mind.
  • Designing and writing software systems that generate prompts automatically. This definition isn’t interactive; it’s automating a task to make it easier for others to do. This work is increasingly falling under the rubric RAG (Retrieval Augmented Generation), in which a program takes a request, looks up data relevant to that request, and packages everything in a complex prompt.

Designing automated prompting systems is clearly important. It gives you much more control over what an AI is likely to do; if you package the information needed to answer a question into the prompt, and tell the AI to limit its response to information included in that package, it’s much less likely to “hallucinate.” But that’s a programming task that isn’t going to be learned in a couple of hours; it typically involves generating embeddings, using a vector database, then generating a chain of prompts that are answered by different systems, combining the answers, and possibly generating more prompts.  Could the basics be learned in a couple of hours? Perhaps, if the learner is already an expert programmer, but that’s ambitious—and may require a definition of “basic” that sets a very low bar.

What about the first, interactive definition? It’s worth noting that all prompts are not created equal. Prompts for ChatGPT are essentially free-form text. Free-form text sounds simple, and it is simple at first. However, more detailed prompts can look like essays, and when you take them apart, you realize that they are essentially computer programs. They tell the computer what to do, even though they aren’t written in a formal computer language. Prompts for an image generation AI like Midjourney can include sections that are written in an almost-formal metalanguage that specifies requirements like resolution, aspect ratio, styles, coordinates, and more. It’s not programming as such, but creating a prompt that produces professional-quality output is much more like programming than “a tarsier fighting with a python.”

So, the first thing anyone needs to learn about prompting is that writing really good prompts is more difficult than it seems. Your first experience with ChatGPT is likely to be “Wow, this is amazing,” but unless you get better at telling the AI precisely what you want, your 20th experience is more likely to be “Wow, this is dull.”

Second, I wouldn’t debate the claim that anyone can learn the basics of writing good prompts in a couple of hours. Chain of thought (in which the prompt includes some examples showing how to solve a problem) isn’t difficult to grasp. Neither is including evidence for the AI to use as part of the prompt. Neither are many of the other patterns that create effective prompts. There’s surprisingly little magic here. But it’s important to take a step back and think about what chain of thought requires: you need to tell the AI how to solve your problem, step by step, which means that you first need to know how to solve your problem. You need to have (or create) other examples that the AI can follow. And you need to decide whether the output the AI generates is correct. In short, you need to know a lot about the problem you’re asking the AI to solve.

That’s why many teachers, particularly in the humanities, are excited about generative AI. When used well, it’s engaging and it encourages students to learn more: learning the right questions to ask, doing the hard research to track down facts, thinking through the logic of the AI’s response carefully, deciding whether or not that response makes sense in its context. Students writing prompts for AI need to think carefully about the points they want to make, how they want to make them, and what supporting facts to use. I’ve made a similar argument about the use of AI in programming. AI tools won’t eliminate programming, but they’ll put more stress on higher-level activities: understanding user requirements, understanding software design, understanding the relationship between components of a much larger system, and strategizing about how to solve a problem. (To say nothing of debugging and testing.) If generative AI helps us put to rest the idea that programming is about antisocial people grinding out lines of code, and helps us to realize that it’s really about humans understanding problems and thinking about how to solve them, the programming profession will be in a better place.

I wouldn’t hesitate to advise anyone to spend two hours learning the basics of writing good prompts—or 4 or 8 hours, for that matter. But the real lesson here is that prompting isn’t the most important thing you can learn. To be really good at prompting, you need to develop expertise in what the prompt is about. You need to become more expert in what you’re already doing—whether that’s programming, art, or humanities. You need to be engaged with the subject matter, not the AI. The AI is only a tool: a very good tool that does things that were unimaginable only a few years ago, but still a tool. If you give in to the seduction of thinking that AI is a repository of expertise and wisdom that a human couldn’t possibly obtain, you’ll never be able to use AI productively.

I wrote a PhD dissertation on late 18th and early 19th century English literature. I didn’t get that degree so that a computer could know everything about English Romanticism for me. I got it because I wanted to know. “Wanting to know” is exactly what it will take to write good prompts. In the long run, the will to learn something yourself will be much more important than a couple of hours training in effective prompting patterns. Using AI as a shortcut so that you don’t have to learn is a big step on the road to irrelevance. The “will to learn” is what will keep you and your job relevant in an age of AI.

Categories: Technology

Automated Mentoring with ChatGPT

O'Reilly Radar - Tue, 2023/10/10 - 03:18

Ethan and Lilach Mollick’s paper Assigning AI: Seven Approaches for Students with Prompts explores seven ways to use AI in teaching. (While this paper is eminently readable, there is a non-academic version in Ethan Mollick’s Substack.) The article describes seven roles that an AI bot like ChatGPT might play in the education process: Mentor, Tutor, Coach, Student, Teammate, Student, Simulator, and Tool. For each role, it includes a detailed example of a prompt that can be used to implement that role, along with an example of a ChatGPT session using the prompt, risks of using the prompt, guidelines for teachers, instructions for students, and instructions to help teacher build their own prompts.

The Mentor role is particularly important to the work we do at O’Reilly in training people in new technical skills. Programming (like any other skill) isn’t just about learning the syntax and semantics of a programming language; it’s about learning to solve problems effectively. That requires a mentor; Tim O’Reilly has always said that our books should be like “someone wise and experienced looking over your shoulder and making recommendations.” So I decided to give the Mentor prompt a try on some short programs I’ve written. Here’s what I learned–not particularly about programming, but about ChatGPT and automated mentoring. I won’t reproduce the session (it was quite long). And I’ll say this now, and again at the end: what ChatGPT can do right now has limitations, but it will certainly get better, and it will probably get better quickly.

First, Ruby and Prime Numbers

I first tried a Ruby program I wrote about 10 years ago: a simple prime number sieve. Perhaps I’m obsessed with primes, but I chose this program because it’s relatively short, and because I haven’t touched it for years, so I was somewhat unfamiliar with how it worked. I started by pasting in the complete prompt from the article (it is long), answering ChatGPT’s preliminary questions about what I wanted to accomplish and my background, and pasting in the Ruby script.

ChatGPT responded with some fairly basic advice about following common Ruby naming conventions and avoiding inline comments (Rubyists used to think that code should be self-documenting. Unfortunately). It also made a point about a puts() method call within the program’s main loop. That’s interesting–the puts() was there for debugging, and I evidently forgot to take it out. It also made a useful point about security: while a prime number sieve raises few security issues, reading command line arguments directly from ARGV rather than using a library for parsing options could leave the program open to attack.

It also gave me a new version of the program with these changes made. Rewriting the program wasn’t appropriate: a mentor should comment and provide advice, but shouldn’t rewrite your work. That should be up to the learner. However, it isn’t a serious problem. Preventing this rewrite is as simple as just adding “Do not rewrite the program” to the prompt.

Second Try: Python and Data in Spreadsheets

My next experiment was with a short Python program that used the Pandas library to analyze survey data stored in an Excel spreadsheet. This program had a few problems–as we’ll see.

ChatGPT’s Python mentoring didn’t differ much from Ruby: it suggested some stylistic changes, such as using snake-case variable names, using f-strings (I don’t know why I didn’t; they’re one of my favorite features), encapsulating more of the program’s logic in functions, and adding some exception checking to catch possible errors in the Excel input file. It also objected to my use of “No Answer” to fill empty cells. (Pandas normally converts empty cells to NaN, “not a number,” and they’re frustratingly hard to deal with.) Useful feedback, though hardly earthshaking. It would be hard to argue against any of this advice, but at the same time, there’s nothing I would consider particularly insightful. If I were a student, I’d soon get frustrated after two or three programs yielded similar responses.

Of course, if my Python really was that good, maybe I only needed a few cursory comments about programming style–but my program wasn’t that good. So I decided to push ChatGPT a little harder. First, I told it that I suspected the program could be simplified by using the dataframe.groupby() function in the Pandas library. (I rarely use groupby(), for no good reason.) ChatGPT agreed–and while it’s nice to have a supercomputer agree with you, this is hardly a radical suggestion. It’s a suggestion I would have expected from a mentor who had used Python and Pandas to work with data. I had to make the suggestion myself.

ChatGPT obligingly rewrote the code–again, I probably should have told it not to. The resulting code looked reasonable, though it made a not-so-subtle change in the program’s behavior: it filtered out the “No answer” rows after computing percentages, rather than before. It’s important to watch out for minor changes like this when asking ChatGPT to help with programming. Such minor changes happen frequently, they look innocuous, but they can change the output. (A rigorous test suite would have helped.) This was an important lesson: you really can’t assume that anything ChatGPT does is correct. Even if it’s syntactically correct, even if it runs without error messages, ChatGPT can introduce changes that lead to errors. Testing has always been important (and under-utilized); with ChatGPT, it’s even more so.

Now for the next test. I accidentally omitted the final lines of my program, which made a number of graphs using Python’s matplotlib library. While this omission didn’t affect the data analysis (it printed the results on the terminal), several lines of code arranged the data in a way that was convenient for the graphing functions. These lines of code were now a kind of “dead code”: code that is executed, but that has no effect on the result. Again, I would have expected a human mentor to be all over this. I would have expected them to say “Look at the data structure graph_data. Where is that data used? If it isn’t used, why is it there?” I didn’t get that kind of help. A mentor who doesn’t point out problems in the code isn’t much of a mentor.

So my next prompt asked for suggestions about cleaning up the dead code. ChatGPT praised me for my insight and agreed that removing dead code was a good idea. But again, I don’t want a mentor to praise me for having good ideas; I want a mentor to notice what I should have noticed, but didn’t. I want a mentor to teach me to watch out for common programming errors, and that source code inevitably degrades over time if you’re not careful–even as it’s improved and restructured.

ChatGPT also rewrote my program yet again. This final rewrite was incorrect–this version didn’t work. (It might have done better if I had been using Code Interpreter, though Code Interpreter is no guarantee of correctness.) That both is, and is not, an issue. It’s yet another reminder that, if correctness is a criterion, you have to check and test everything ChatGPT generates carefully. But–in the context of mentoring–I should have written a prompt that suppressed code generation; rewriting your program isn’t the mentor’s job. Furthermore, I don’t think it’s a terrible problem if a mentor occasionally gives you poor advice. We’re all human (at least, most of us). That’s part of the learning experience. And it’s important for us to find applications for AI where errors are tolerable.

So, what’s the score?

  • ChatGPT is good at giving basic advice. But anyone who’s serious about learning will soon want advice that goes beyond the basics.
  • ChatGPT can recognize when the user makes good suggestions that go beyond simple generalities, but is unable to make those suggestions itself. This happened twice: when I had to ask it about groupby(), and when I asked it about cleaning up the dead code.
  • Ideally, a mentor shouldn’t generate code. That can be fixed easily. However, if you want ChatGPT to generate code implementing its suggestions, you have to check carefully for errors, some of which may be subtle changes in program’s behavior.
Not There Yet

Mentoring is an important application for language models, not the least because it finesses one of their biggest problems, their tendency to make mistakes and create errors. A mentor that occasionally makes a bad suggestion isn’t really a problem; following the suggestion and discovering that it’s a dead end is an important learning experience in itself. You shouldn’t believe everything you hear, even if it comes from a reliable source. And a mentor really has no business generating code, incorrect or otherwise.

I’m more concerned about ChatGPT’s difficulty in providing advice that’s truly insightful, the kind of advice that you really want from a mentor. It is able to provide advice when you ask it about specific problems–but that’s not enough. A mentor needs to help a student explore problems; a student who is already aware of the problem is well on their way towards solving it, and may not need the mentor at all.

ChatGPT and other language models will inevitably improve, and their ability to act as a mentor will be important to people who are building new kinds of learning experiences. But they haven’t arrived yet. For the time being, if you want a mentor, you’re on your own.

Categories: Technology

Radar Trends to Watch: October 2023

O'Reilly Radar - Tue, 2023/10/03 - 03:01

AI continues to spread. This month, the AI category is limited to developments about AI itself; tools for AI programming are covered in the Programming section.

One of the biggest issues for AI these days is legal. Getty Images is protecting customers who use their generative AI from copyright lawsuits; Microsoft is doing the same for users of their Copilot products.

Also on the legal front: Hashicorp’s switch to a non-open source license has led the OpenTF foundation to build OpenTofu, a fork of Hashicorp’s Terraform product. While it’s too early to say, OpenTofu has quickly gotten some significant adopters.

  • OpenAI has announced that ChatGPT will support voice chats. Will its voice persona be as verbose and obsequious as its text persona?
  • Getty Image has announced a generative image creation model that has been trained exclusively on images for which Getty owns the copyright. Getty will reimburse customers’ legal costs if they are sued for copyright infringement. Getty is compensating artists for the use of their work.
  • Sony and Meta have developed new ways to measure racial bias in computer vision. Sony has developed a two dimensional model for skin tone that accounts for hue in addition to darkness. Meta has released an open source dataset named FACET for testing AI models.
  • The Toyota Research Institute has built robots with large behavior models that use techniques from large language models. These robots have proved much more versatile and easier to train than previous robots.
  • Open AI has released DALL-E 3, a new image synthesis AI that’s built on top of ChatGPT. It is far better at understanding simple prompts without complex prompt design. It will become a feature of ChatGPT+, and has been integrated into Microsoft’s Bing.
  • In an effort to throttle a flood of AI-generated books, Amazon has limited authors to three books per day. That still seems like a lot—it’s unlikely that a human author could produce one book per day, let alone three.
  • Updates to Google’s Bard include integration with Maps, Google Docs, and a “Check your answer” button. Checking seems to be limited to verifying facts using search results (for which Bard gives citations), but it’s still useful.
  • Optimization by Prompting is a new technique for developing effective prompts. OPRO uses an AI model to optimize the prompts used to solve a problem. Starting with “Take a deep breath” evidently helps.
  • Google’s DeepMind has developed an AI model that can identify variants in genes that could potentially cause disease.
  • Competition in the vector database space is heating up. LanceDB is yet another entry. It is open source, and is designed to be embedded within apps, with no external server to manage. Data is stored on local hard disks, making it conceptually similar to SQLite.
  • Stability AI has released a new demo of generative AI for music, called (unsurprisingly) Stable Audio. Generative AI approaches to music lag behind generative art or text, but Stable Audio has clearly made some progress.
  • Microsoft has announced that it will assume liability for copyright infringement by all of its Copilot products (not just GitHub). They claim to have built guardrails and filters into their products to prevent infringement.
  • HuggingFace now offers Training Cluster as a Service. This service allows you to use their infrastructure to train large language models at scale. The home page lets you build a cost estimate, based on the model size, the training data size, and the number and type of GPUs.
  • Pixel tracking means something different now. MetaAI has announced CoTracker, a Transformer-based tool that tracks the movement of multiple points through a video. Source code is available on GitHub under a Creative Commons license.
  • Google has released DuetAI, its AI-driven extensions to its Workspace suit (GMail, Docs, etc.). Although there is a free trial, there will be an additional fee for using Duet. It can take notes on meetings in Google Meet, write emails and reports, participate in chats, and more.
  • Google’s DeepMind has launched SynthID, a watermarking tool for AI images. It includes tools for watermarking and detecting the presence of watermarks. SynthID is still experimental, and only available to users of Google’s Imagen, which itself is only available within Vertex AI.
  • The free, open source Godot game engine is proving to be an alternative to Unity. While Unity has (mostly) backed off from its plans to require per-install fees, it has lost trust with much of its development community.
  • OpenTofu, OpenTF’s fork of Hashicorp’s Terraform, has been backed by the Linux Foundation and adopted by several major enterprises.
  • DSPy is an alternative to Langchain and Llamaindex for programming applications with large language models. It stresses programming, rather than prompting. It minimizes the need for labeling and “prompt engineering,” and claims the ability to optimize training and prompting.
  • Zep is yet another framework for building applications with large language models and putting them into production. It incorporates Llamaindex and Langchain.
  • Tools that analyze source code and trace its origins in open source projects are appearing. The development and use of these tools is driven by automated code generators that can infringe upon open source licenses.
  • The WebAssembly Go Playground is a Go compiler and runtime environment that runs completely in the browser.
  • Wasmer is a sandbox for running WebAssembly apps. It allows you to run Wasm applications on the command line or in the cloud with extremely lightweight packaging.
  • Guidance is a programming language for controlling large language models.
  • Microsoft and Anaconda have launched Python in Excel, which allows Excel users to embed Python within spreadsheets.
  • Rivet is a graphical IDE for developing applications for large language models. With minimal coding, users can build prompt flows, using tools like vector databases. It’s part of a growing ecosystem of low-code tools for AI development.
  • JetBrains has released RustRover, a new IDE for Rust. RustRover does not incorporate AI, although it does have the ability to suggest bug fixes. It supports collaboration, and integrates GitHub, the Rust toolchain (of course), and unit testing tools.
  • Refact is a new language model that is designed to support refactoring; it includes fill-in-the-middle support. It is relatively small (1.6B parameters), and has performance equivalent to other publicly testable language models.
  • HuggingFace has developed a new machine learning framework for Rust called Candle. Candle includes GPU support. The GitHub repo links to a number of examples.
  • Google, Apple, and Mozilla have reported a severe vulnerability in the WebP image compression library that is actively being exploited. Fixes are in the current stable release of Chrome and other browsers, but other applications that rely on WebP are vulnerable.
  • The NSA, FBI, and Cybersecurity and Infrastructure Security Agency have published a CyberSecurity Information Sheet about Deepfakes that includes advice on detecting deepfakes and defending against them.
  • Google is releasing an API for their Outline VPN to developers to build the VPN into their products. Outline has been useful for evading government censorship. The API and SDK will make it easier to build workarounds when governments learn how to detect the use of Outline.
  • Any sufficiently advanced uninstaller is indistinguishable from malware. You have to read it just for the title. A nice piece of analysis.
  • Security breaches frequently occur when an employee leaves a company, but retains access to internal apps or services. Just in time access minimizes the risk by granting access to services only as needed, and for a limited time.
  • Few security stories have happy endings. Here’s one that does: the FBI managed to infiltrate the Quakbot botnet, redirect traffic to its own servers, and use Quakbot to automatically uninstall its own software.
  • How do you maintain security for software that’s updated from a repository? Proper key management (including keeping keys offline) and expiring old metadata are important.
  • MalDoc is a new attack in which a Word document with malicious VB macros is embedded in a PDF document. The document is treated as a PDF by malware scanners, but can be opened either as a Word document (which executes the macros) or as a PDF.
  • Research by Mozilla has shown that connected cars are terrible for privacy. They collect personal data, including video, and send it back to the manufacturer, who can sell it, give it to law enforcement, or use it in other ways without consent. Management of the data doesn’t meet minimum security standards.
  • The Signal Protocol, a protocol for end-to-end encryption, has been upgraded for post-quantum cryptography. The Signal protocol is used by the Signal app, Google’s RCS messaging, and WhatsApp.
  • Two new decentralized projects provide services that previously were only available through centralized servers: Quiet, a team chat app that’s an alternative to Slack and Discord; and Postmarks, a social bookmarking service that’s a successor to the defunct
  • Wavacity is the Audacity audio editor ported to the browser: another tour de force for WASM.
  • Cory Doctorow’s interview about saving the open Web is a must-read. Interoperability is the key.
  • Web LLM now supports LLaMA 2 in the browser! Everything runs in the browser, using WebGPU for GPU acceleration. (Chrome only. Be prepared for a long download when you try the demo.)
  • Humanity’s oldest writing is preserved on ceramics. That may be the future of data storage, too: a startup has developed ceramic-coated tape with storage of up to 1 Petabyte per tape. A data center could easily house a Yottabyte’s worth of tapes.
  • Qualcomm is making a big investment in RISC-V. RISC-V is an open source instruction set architecture. We’ve said several times that RISC-V is on the verge of competing with ARM and Intel; adoption by a vendor like Qualcomm is an important step on that path.
Quantum Computing
  • Researchers used a quantum computer to slow down a chemical process by a factor of 100 billion, allowing them to observe it. This experiment demonstrates the use of a quantum computer as a research tool, aside from its ability to compute.
  • IBM has announced a significant breakthrough in quantum error correction. While QEC remains a difficult and unsolved problem, their work reduces the number of physical qubits needed to construct a virtual error-corrected qubit by a factor of 10.
  • DIY tools that automate insulin delivery systems for managing diabetes are becoming accepted more widely, and can significantly outperform commercial systems. One DIY system has received FDA clearance.
Categories: Technology

Structural Evolutions in Data

O'Reilly Radar - Tue, 2023/09/19 - 04:55

I am wired to constantly ask “what’s next?” Sometimes, the answer is: “more of the same.”

That came to mind when a friend raised a point about emerging technology’s fractal nature. Across one story arc, they said, we often see several structural evolutions—smaller-scale versions of that wider phenomenon.

Cloud computing? It progressed from “raw compute and storage” to “reimplementing key services in push-button fashion” to “becoming the backbone of AI work”—all under the umbrella of “renting time and storage on someone else’s computers.” Web3 has similarly progressed through “basic blockchain and cryptocurrency tokens” to “decentralized finance” to “NFTs as loyalty cards.” Each step has been a twist on “what if we could write code to interact with a tamper-resistant ledger in real-time?”

Most recently, I’ve been thinking about this in terms of the space we currently call “AI.” I’ve called out the data field’s rebranding efforts before; but even then, I acknowledged that these weren’t just new coats of paint. Each time, the underlying implementation changed a bit while still staying true to the larger phenomenon of “Analyzing Data for Fun and Profit.”

Consider the structural evolutions of that theme:

Stage 1: Hadoop and Big Data

By 2008, many companies found themselves at the intersection of “a steep increase in online activity” and “a sharp decline in costs for storage and computing.” They weren’t quite sure what this “data” substance was, but they’d convinced themselves that they had tons of it that they could monetize. All they needed was a tool that could handle the massive workload. And Hadoop rolled in.

In short order, it was tough to get a data job if you didn’t have some Hadoop behind your name. And harder to sell a data-related product unless it spoke to Hadoop. The elephant was unstoppable.

Until it wasn’t. 

Hadoop’s value—being able to crunch large datasets—often paled in comparison to its costs. A basic, production-ready cluster priced out to the low-six-figures. A company then needed to train up their ops team to manage the cluster, and their analysts to express their ideas in MapReduce. Plus there was all of the infrastructure to push data into the cluster in the first place.

If you weren’t in the terabytes-a-day club, you really had to take a step back and ask what this was all for. Doubly so as hardware improved, eating away at the lower end of Hadoop-worthy work.

And then there was the other problem: for all the fanfare, Hadoop was really large-scale business intelligence (BI).

(Enough time has passed; I think we can now be honest with ourselves. We built an entire industry by … repackaging an existing industry. This is the power of marketing.)

Don’t get me wrong. BI is useful. I’ve sung its praises time and again. But the grouping and summarizing just wasn’t exciting enough for the data addicts. They’d grown tired of learning what is; now they wanted to know what’s next.

Stage 2: Machine learning models

Hadoop could kind of do ML, thanks to third-party tools. But in its early form of a Hadoop-based ML library, Mahout still required data scientists to write in Java. And it (wisely) stuck to implementations of industry-standard algorithms. If you wanted ML beyond what Mahout provided, you had to frame your problem in MapReduce terms. Mental contortions led to code contortions led to frustration. And, often, to giving up.

(After coauthoring Parallel R I gave a number of talks on using Hadoop. A common audience question was “can Hadoop run [my arbitrary analysis job or home-grown algorithm]?” And my answer was a qualified yes: “Hadoop could theoretically scale your job. But only if you or someone else will take the time to implement that approach in MapReduce.” That didn’t go over well.)

Goodbye, Hadoop. Hello, R and scikit-learn. A typical data job interview now skipped MapReduce in favor of white-boarding k-means clustering or random forests.

And it was good. For a few years, even. But then we hit another hurdle.

While data scientists were no longer handling Hadoop-sized workloads, they were trying to build predictive models on a different kind of “large” dataset: so-called “unstructured data.” (I prefer to call that “soft numbers,” but that’s another story.) A single document may represent thousands of features. An image? Millions.

Similar to the dawn of Hadoop, we were back to problems that existing tools could not solve.

The solution led us to the next structural evolution. And that brings our story to the present day:

Stage 3: Neural networks

High-end video games required high-end video cards. And since the cards couldn’t tell the difference between “matrix algebra for on-screen display” and “matrix algebra for machine learning,” neural networks became computationally feasible and commercially viable. It felt like, almost overnight, all of machine learning took on some kind of neural backend. Those algorithms packaged with scikit-learn? They were unceremoniously relabeled “classical machine learning.”

There’s as much Keras, TensorFlow, and Torch today as there was Hadoop back in 2010-2012. The data scientist—sorry, “machine learning engineer” or “AI specialist”—job interview now involves one of those toolkits, or one of the higher-level abstractions such as HuggingFace Transformers.

And just as we started to complain that the crypto miners were snapping up all of the affordable GPU cards, cloud providers stepped up to offer access on-demand. Between Google (Vertex AI and Colab) and Amazon (SageMaker), you can now get all of the GPU power your credit card can handle. Google goes a step further in offering compute instances with its specialized TPU hardware.

Not that you’ll even need GPU access all that often. A number of groups, from small research teams to tech behemoths, have used their own GPUs to train on large, interesting datasets and they give those models away for free on sites like TensorFlow Hub and Hugging Face Hub. You can download these models to use out of the box, or employ minimal compute resources to fine-tune them for your particular task.

You see the extreme version of this pretrained model phenomenon in the large language models (LLMs) that drive tools like Midjourney or ChatGPT. The overall idea of generative AI is to get a model to create content that could have reasonably fit into its training data. For a sufficiently large training dataset—say, “billions of online images” or “the entirety of Wikipedia”—a model can pick up on the kinds of patterns that make its outputs seem eerily lifelike.

Since we’re covered as far as compute power, tools, and even prebuilt models, what are the frictions of GPU-enabled ML? What will drive us to the next structural iteration of Analyzing Data for Fun and Profit?

Stage 4? Simulation

Given the progression thus far, I think the next structural evolution of Analyzing Data for Fun and Profit will involve a new appreciation for randomness. Specifically, through simulation.

You can see a simulation as a temporary, synthetic environment in which to test an idea. We do this all the time, when we ask “what if?” and play it out in our minds. “What if we leave an hour earlier?” (We’ll miss rush hour traffic.) “What if I bring my duffel bag instead of the roll-aboard?” (It will be easier to fit in the overhead storage.) That works just fine when there are only a few possible outcomes, across a small set of parameters.

Once we’re able to quantify a situation, we can let a computer run “what if?” scenarios at industrial scale. Millions of tests, across as many parameters as will fit on the hardware. It’ll even summarize the results if we ask nicely. That opens the door to a number of possibilities, three of which I’ll highlight here:

Moving beyond from point estimates

Let’s say an ML model tells us that this house should sell for $744,568.92. Great! We’ve gotten a machine to make a prediction for us. What more could we possibly want?

Context, for one. The model’s output is just a single number, a point estimate of the most likely price. What we really want is the spread—the range of likely values for that price. Does the model think the correct price falls between $743k-$746k? Or is it more like $600k-$900k? You want the former case if you’re trying to buy or sell that property.

Bayesian data analysis, and other techniques that rely on simulation behind the scenes, offer additional insight here. These approaches vary some parameters, run the process a few million times, and give us a nice curve that shows how often the answer is (or, “is not”) close to that $744k.

Similarly, Monte Carlo simulations can help us spot trends and outliers in potential outcomes of a process. “Here’s our risk model. Let’s assume these ten parameters can vary, then try the model with several million variations on those parameter sets. What can we learn about the potential outcomes?” Such a simulation could reveal that, under certain specific circumstances, we get a case of total ruin. Isn’t it nice to uncover that in a simulated environment, where we can map out our risk mitigation strategies with calm, level heads?

Moving beyond point estimates is very close to present-day AI challenges. That’s why it’s a likely next step in Analyzing Data for Fun and Profit. In turn, that could open the door to other techniques:

New ways of exploring the solution space

If you’re not familiar with evolutionary algorithms, they’re a twist on the traditional Monte Carlo approach. In fact, they’re like several small Monte Carlo simulations run in sequence. After each iteration, the process compares the results to its fitness function, then mixes the attributes of the top performers. Hence the term “evolutionary”—combining the winners is akin to parents passing a mix of their attributes on to progeny. Repeat this enough times and you may just find the best set of parameters for your problem.

(People familiar with optimization algorithms will recognize this as a twist on simulated annealing: start with random parameters and attributes, and narrow that scope over time.)

A number of scholars have tested this shuffle-and-recombine-till-we-find-a-winner approach on timetable scheduling. Their research has applied evolutionary algorithms to groups that need efficient ways to manage finite, time-based resources such as classrooms and factory equipment. Other groups have tested evolutionary algorithms in drug discovery. Both situations benefit from a technique that optimizes the search through a large and daunting solution space.

The NASA ST5 antenna is another example. Its bent, twisted wire stands in stark contrast to the straight aerials with which we are familiar. There’s no chance that a human would ever have come up with it. But the evolutionary approach could, in part because it was not limited by human sense of aesthetic or any preconceived notions of what an “antenna” could be. It just kept shuffling the designs that satisfied its fitness function until the process finally converged.

Taming complexity

Complex adaptive systems are hardly a new concept, though most people got a harsh introduction at the start of the Covid-19 pandemic. Cities closed down, supply chains snarled, and people—independent actors, behaving in their own best interests—made it worse by hoarding supplies because they thought distribution and manufacturing would never recover. Today, reports of idle cargo ships and overloaded seaside ports remind us that we shifted from under- to over-supply. The mess is far from over.

What makes a complex system troublesome isn’t the sheer number of connections. It’s not even that many of those connections are invisible because a person can’t see the entire system at once. The problem is that those hidden connections only become visible during a malfunction: a failure in Component B affects not only neighboring Components A and C, but also triggers disruptions in T and R. R’s issue is small on its own, but it has just led to an outsized impact in Φ and Σ.

(And if you just asked “wait, how did Greek letters get mixed up in this?” then …  you get the point.)

Our current crop of AI tools is powerful, yet ill-equipped to provide insight into complex systems. We can’t surface these hidden connections using a collection of independently-derived point estimates; we need something that can simulate the entangled system of independent actors moving all at once.

This is where agent-based modeling (ABM) comes into play. This technique simulates interactions in a complex system. Similar to the way a Monte Carlo simulation can surface outliers, an ABM can catch unexpected or unfavorable interactions in a safe, synthetic environment.

Financial markets and other economic situations are prime candidates for ABM. These are spaces where a large number of actors behave according to their rational self-interest, and their actions feed into the system and affect others’ behavior. According to practitioners of complexity economics (a study that owes its origins to the Sante Fe Institute), traditional economic modeling treats these systems as though they run in an equilibrium state and therefore fails to identify certain kinds of disruptions. ABM captures a more realistic picture because it simulates a system that feeds back into itself.

Smoothing the on-ramp

Interestingly enough, I haven’t mentioned anything new or ground-breaking. Bayesian data analysis and Monte Carlo simulations are common in finance and insurance. I was first introduced to evolutionary algorithms and agent-based modeling more than fifteen years ago. (If memory serves, this was shortly before I shifted my career to what we now call AI.) And even then I was late to the party.

So why hasn’t this next phase of Analyzing Data for Fun and Profit taken off?

For one, this structural evolution needs a name. Something to distinguish it from “AI.” Something to market. I’ve been using the term “synthetics,” so I’ll offer that up. (Bonus: this umbrella term neatly includes generative AI’s ability to create text, images, and other realistic-yet-heretofore-unseen data points. So we can ride that wave of publicity.)

Next up is compute power. Simulations are CPU-heavy, and sometimes memory-bound. Cloud computing providers make that easier to handle, though, so long as you don’t mind the credit card bill. Eventually we’ll get simulation-specific hardware—what will be the GPU or TPU of simulation?—but I think synthetics can gain traction on existing gear.

The third and largest hurdle is the lack of simulation-specific frameworks. As we surface more use cases—as we apply these techniques to real business problems or even academic challenges—we’ll improve the tools because we’ll want to make that work easier. As the tools improve, that reduces the costs of trying the techniques on other use cases. This kicks off another iteration of the value loop. Use cases tend to magically appear as techniques get easier to use.

If you think I’m overstating the power of tools to spread an idea, imagine trying to solve a problem with a new toolset while also creating that toolset at the same time. It’s tough to balance those competing concerns. If someone else offers to build the tool while you use it and road-test it, you’re probably going to accept. This is why these days we use TensorFlow or Torch instead of hand-writing our backpropagation loops.

Today’s landscape of simulation tooling is uneven. People doing Bayesian data analysis have their choice of two robust, authoritative offerings in Stan and PyMC3, plus a variety of books to understand the mechanics of the process. Things fall off after that. Most of the Monte Carlo simulations I’ve seen are of the hand-rolled variety. And a quick survey of agent-based modeling and evolutionary algorithms turns up a mix of proprietary apps and nascent open-source projects, some of which are geared for a particular problem domain.

As we develop the authoritative toolkits for simulations—the TensorFlow of agent-based modeling and the Hadoop of evolutionary algorithms, if you will—expect adoption to grow. Doubly so, as commercial entities build services around those toolkits and rev up their own marketing (and publishing, and certification) machines.

Time will tell

My expectations of what to come are, admittedly, shaped by my experience and clouded by my interests. Time will tell whether any of this hits the mark.

A change in business or consumer appetite could also send the field down a different road. The next hot device, app, or service will get an outsized vote in what companies and consumers expect of technology.

Still, I see value in looking for this field’s structural evolutions. The wider story arc changes with each iteration to address changes in appetite. Practitioners and entrepreneurs, take note.

Job-seekers should do the same. Remember that you once needed Hadoop on your résumé to merit a second look; nowadays it’s a liability. Building models is a desired skill for now, but it’s slowly giving way to robots. So do you really think it’s too late to join the data field? I think not.

Keep an eye out for that next wave. That’ll be your time to jump in.

Categories: Technology

The Real Problem with Software Development

O'Reilly Radar - Tue, 2023/09/12 - 04:01

A few weeks ago, I saw a tweet that said “Writing code isn’t the problem. Controlling complexity is.” I wish I could remember who said that; I will be quoting it a lot in the future. That statement nicely summarizes what makes software development difficult. It’s not just memorizing the syntactic details of some programming language, or the many functions in some API, but understanding and managing the complexity of the problem you’re trying to solve.

We’ve all seen this many times. Lots of applications and tools start simple. They do 80% of the job well, maybe 90%. But that isn’t quite enough. Version 1.1 gets a few more features, more creep into version 1.2, and by the time you get to 3.0, an elegant user interface has turned into a mess. This increase in complexity is one reason that applications tend to become less useable over time. We also see this phenomenon as one application replaces another. RCS was useful, but didn’t do everything we needed it to; SVN was better; Git does just about everything you could want, but at an enormous cost in complexity. (Could Git’s complexity be managed better? I’m not the one to say.) OS X, which used to trumpet “It just works,” has evolved to “it used to just work”; the most user-centric Unix-like system ever built now staggers under the load of new and poorly thought-out features.

The problem of complexity isn’t limited to user interfaces; that may be the least important (though most visible) aspect of the problem. Anyone who works in programming has seen the source code for some project evolve from something short, sweet, and clean to a seething mass of bits. (These days, it’s often a seething mass of distributed bits.) Some of that evolution is driven by an increasingly complex world that requires attention to secure programming, cloud deployment, and other issues that didn’t exist a few decades ago. But even here: a requirement like security tends to make code more complex—but complexity itself hides security issues. Saying “yes, adding security made the code more complex” is wrong on several fronts. Security that’s added as an afterthought almost always fails. Designing security in from the start almost always leads to a simpler result than bolting security on as an afterthought, and the complexity will stay manageable if new features and security grow together. If we’re serious about complexity, the complexity of building secure systems needs to be managed and controlled in step with the rest of the software, otherwise it’s going to add more vulnerabilities.

That brings me to my main point. We’re seeing more code that’s written (at least in first draft) by generative AI tools, such as GitHub Copilot, ChatGPT (especially with Code Interpreter), and Google Codey. One advantage of computers, of course, is that they don’t care about complexity. But that advantage is also a significant disadvantage. Until AI systems can generate code as reliably as our current generation of compilers, humans will need to understand—and debug—the code they write. Brian Kernighan wrote that “Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?” We don’t want a future that consists of code too clever to be debugged by humans—at least not until the AIs are ready to do that debugging for us. Really brilliant programmers write code that finds a way out of the complexity: code that may be a little longer, a little clearer, a little less clever so that someone can understand it later. (Copilot running in VSCode has a button that simplifies code, but its capabilities are limited.)

Furthermore, when we’re considering complexity, we’re not just talking about individual lines of code and individual functions or methods. Most professional programmers work on large systems that can consist of thousands of functions and millions of lines of code. That code may take the form of dozens of microservices running as asynchronous processes and communicating over a network. What is the overall structure, the overall architecture, of these programs? How are they kept simple and manageable? How do you think about complexity when writing or maintaining software that may outlive its developers? Millions of lines of legacy code going back as far as the 1960s and 1970s are still in use, much of it written in languages that are no longer popular. How do we control complexity when working with these?

Humans don’t manage this kind of complexity well, but that doesn’t mean we can check out and forget about it. Over the years, we’ve gradually gotten better at managing complexity. Software architecture is a distinct specialty that has only become more important over time. It’s growing more important as systems grow larger and more complex, as we rely on them to automate more tasks, and as those systems need to scale to dimensions that were almost unimaginable a few decades ago. Reducing the complexity of modern software systems is a problem that humans can solve—and I haven’t yet seen evidence that generative AI can. Strictly speaking, that’s not a question that can even be asked yet. Claude 2 has a maximum context—the upper limit on the amount of text it can consider at one time—of 100,000 tokens1; at this time, all other large language models are significantly smaller. While 100,000 tokens is huge, it’s much smaller than the source code for even a moderately sized piece of enterprise software. And while you don’t have to understand every line of code to do a high-level design for a software system, you do have to manage a lot of information: specifications, user stories, protocols, constraints, legacies and much more. Is a language model up to that?

Could we even describe the goal of “managing complexity” in a prompt? A few years ago, many developers thought that minimizing “lines of code” was the key to simplification—and it would be easy to tell ChatGPT to solve a problem in as few lines of code as possible. But that’s not really how the world works, not now, and not back in 2007. Minimizing lines of code sometimes leads to simplicity, but just as often leads to complex incantations that pack multiple ideas onto the same line, often relying on undocumented side effects. That’s not how to manage complexity. Mantras like DRY (Don’t Repeat Yourself) are often useful (as is most of the advice in The Pragmatic Programmer), but I’ve made the mistake of writing code that was overly complex to eliminate one of two very similar functions. Less repetition, but the result was more complex and harder to understand. Lines of code are easy to count, but if that’s your only metric, you will lose track of qualities like readability that may be more important. Any engineer knows that design is all about tradeoffs—in this case, trading off repetition against complexity—but difficult as these tradeoffs may be for humans, it isn’t clear to me that generative AI can make them any better, if at all.

I’m not arguing that generative AI doesn’t have a role in software development. It certainly does. Tools that can write code are certainly useful: they save us looking up the details of library functions in reference manuals, they save us from remembering the syntactic details of the less commonly used abstractions in our favorite programming languages. As long as we don’t let our own mental muscles decay, we’ll be ahead. I am arguing that we can’t get so tied up in automatic code generation that we forget about controlling complexity. Large language models don’t help with that now, though they might in the future. If they free us to spend more time understanding and solving the higher-level problems of complexity, though, that will be a significant gain.

Will the day come when a large language model will be able to write a million line enterprise program? Probably. But someone will have to write the prompt telling it what to do. And that person will be faced with the problem that has characterized programming from the start: understanding complexity, knowing where it’s unavoidable, and controlling it.

  1. It’s common to say that a token is approximately ⅘ of a word. It’s not clear how that applies to source code, though. It’s also common to say that 100,000 words is the size of a novel, but that’s only true for rather short novels.
Categories: Technology

Radar Trends to Watch: September 2023

O'Reilly Radar - Tue, 2023/09/05 - 03:14

While the AI group is still the largest, it’s notable that Programming, Web, and Security are all larger than they’ve been in recent months. One reason is certainly that we’re pushing AI news into other categories as appropriate. But I also think that it’s harder to impress with AI than it used to be. AI discussions have been much more about regulation and intellectual property—which makes me wonder whether legislation should be a separate category.

That notwithstanding, it’s important that OpenAI is now allowing API users to fine-tune their GPT-4 apps. It’s as-a-service, of course. And RISC-V finally appears to be getting some serious adoption. Could it compete with Atom and Intel? We shall see.

  • OpenAI has announced ChatGPT Enterprise, a version of ChatGPT that targets enterprise customers. ChatGPT Enterprise offers improved security, a promise that they won’t train on your conversations, single sign on, an admin console, a larger 32K context, higher performance, and the elimination of usage caps.
  • Facebook/Meta has released Code LLaMA, a version of their LLaMA 2 model that has been specialized for writing code. It can be used for code generation or completion. Its context window is 100,000 tokens, allowing Code LLaMA to be more accurate on larger programs.
  • OpenAI has announced that API users can now fine-tune GPT-3.5 for their own applications. Fine-tuning for GPT-4 will come later. To preserve safety, tuning data is passed through OpenAI’s moderation filter.
  • txtai is an open source embeddings database. It is a vector database that has been designed specifically to work with natural language problems.
  • TextFX is a set of tools that use Google’s PaLM 2 model to play with language. It doesn’t answer questions or write poems; it allows users to see the possibilities in words as an aid to their own creativity.
  • A US judge has ruled that an AI system cannot copyright a work. In this case, the AI itself—not the human user—was to hold the copyright. This ruling is in line with the Copyright Office’s guidance: giving prompts to a generative algorithm isn’t sufficient to create a copyrightable work.
  • Despite an error rate of roughly 50% for ChatGPT, a study shows that users prefer ChatGPT’s answers to programming questions over answers from StackOverflow. ChatGPT’s complete, articulate, and polite answers appear to be the cause of this preference.
  • AI was on the agenda at DefCon and, while results of a red teaming competition won’t be released for some months, it’s clear that security remains an afterthought, and that attacking the current AI models is extremely easy.
  • Emotion recognition is difficult, if not impossible. It is not clear that there are any credible use cases for it. AI systems are particularly bad at it. But companies are building products.
  • Watermarking has been proposed as a technique for identifying whether content was generated by AI, but it’s not a panacea. Here are some questions to help evaluate whether watermarks are useful in any given situation.
  • Zoom and Grammarly have both issued new license agreements that allow them to use data collected from users to train AI. Zoom has backed down after customer backlash, but that begs the question: Will other applications follow?
  • Using large language models for work or play is one thing, but how do you put one into production? 7 Frameworks for Serving LLMs surveys some tools for deploying language models.
  • Simon Willison provides instructions for running LLaMA 2 on a Mac. He also provides slides and a well-edited transcript of his talk about LLMs at North Bay Python.
  • PhotoGuard is a tool for protecting photos and other images from manipulation by AI systems. It adds data to the image in ways that aren’t detectable by humans, but that introduce noticeable distortions when the image is modified.
  • C2PA is a cryptographic protocol for attesting to the provenance of electronic documents. It could be used for specifying whether documents are generated by AI.
  • Google’s DeepMind has built a vision-language-action model called RT-2 (Robotic Transformer 2) that combines vision and language with the ability to control a robot. It learns both from web data (images and text) and robotic data (interactions with physical objects).
  • Maccarone is an extension to VSCode that allows you to “delegate” blocks of Python code to AI (GPT-4). The portions of the code that are under AI control are automatically updated as needed when the surrounding code is changed.
  • Microsoft is adding Python as a scripting language for Excel formulas. Python code executes in an Azure container that includes some commonly used libraries, including Matplotlib and Pandas.
  • Many companies are building platform engineering teams as a means of making software developers more effective. Here are some ideas about getting started with platform engineering.
  • A Google study of its in-house Rust use supports the claim that Rust makes it easier to produce high-quality code. The study also busts a number of myths about the language. It isn’t as hard to learn as most people think (then again, this is a Google study).
  • deno_python is a Javascript module that allows integration between Javascript (running on Deno) and Python, allowing Javascript programmers to call important Python libraries and call Python functions.
  • The Python Steering Council has announced that it will make the Global Interpreter Lock (GIL) optional in a future version of Python. Python’s GIL has long been a barrier to effective multi-threaded computing. The change will be backwards-compatible.
  • Google’s controversial Web Environment Integrity proposal provides a way for web servers to cryptographically authenticate the browser software making a request. WEI could potentially reduce online fraud, but it also presents some significant privacy risks.
  • Trafilatura is a new tool for web scraping that has been designed with quantitative research (for example, assembling training data for language models). It can extract text and metadata from HTML, and generate output in a number of formats.
  • Astro is yet another open source web framework that’s designed for high performance and ease of development.
  • While the “browser wars” are far behind us, it is still difficult for developers to write code that works correctly on all browsers. Baseline is a project of the W3C’s WebDX Community Group that specifies which features web developers can rely on in the most widely used browsers.
  • How Large Language Models Assisted a Website Makeover raises some important questions: When do you stop using ChatGPT and finish the job yourself?  When does your own ability start to atrophy?
  • Remember Flash? It has a museum… And Flash games will run in a modern browser using Ruffle, a Flash Player emulator that is written in WebAssembly.
  • Proof-of-work makes it to the Tor network. It is used as a defense against denial of service attacks. PoW is disabled most of the time, but when traffic seems unusually high, it can switch on, forcing users to “prove” their humanness (actually, their willingness to perform work).
  • A retrospective on this year’s MoveIT attack draws some important conclusions about protecting your assets. Mapping the supply chain, third party risk management, zero trust, and continuous penetration testing are all important parts of a security plan.
  • Bitwarden has released an open source end-to-end encrypted secrets manager. The secrets manager allows safe distribution of API keys, certificates and other sensitive data.
  • The US Government has announced the AI Cybersecurity Challenge (AIxCC). AIxCC is a two year competition to build AI systems that can secure critical software. There’s $18.5 million in prizes, plus the possibility of DARPA funding for up to seven companies.
  • OSC&R is the Open Source Supply Chain Attack Reference, a new project that catalogs and describes techniques used to attack software supply chains. It is modeled on MITRE’s ATT&CK framework.
  • The Lapsus$ group has become one of the most effective threat actors, despite being relatively unsophisticated. They rely on persistence, clever social engineering, and analyzing weak points in an organization’s security posture rather than compromising infrastructure.
  • The NSA has issued a report that gives guidance on how to protect systems against memory safety bugs.
  • Bruce Schneier has an important take on the long-term consequences of the SolarWinds attack. Those consequences include the theft of an Azure customer account signing key that in turn has been used by attackers to access US government email accounts.
  • A new generation of ransomware attacks is targeting IT professionals via fake advertisements for IT tools. While IT professionals are (presumably) more wary and aware than other users, they are also high-value targets.
  • Parmesan cheese producers are experimenting with adding microchips to the cheese rind to authenticate genuine cheese.
  • Adoption of RISC-V, a royalty-free open source instruction set architecture for microprocessors, has been increasing. Could it displace ARM?
  • Speculative execution bugs have been discovered for recent Intel (“Downfall”) and AMD (“Inception”) processors. Patches for Linux have been released.
Operations Quantum Computing
  • Peter Shor, inventor of the quantum algorithm for factoring prime numbers (which in turn could be used to break most modern cryptography that isn’t quantum-resistant), has published the lecture notes from the course on quantum computing that he teaches at MIT.
  • A Honeywell quantum computer has been used to find a material that can improve solar cell efficiency. It’s likely that the first applications of quantum computing will involve simulating quantum phenomena rather than pure computation.
  • If you’re interested in iris-scanning WorldCoin, a cryptographer analyzes the privacy promises made by their system. He remains skeptical, but came away less unimpressed than he expected to be.
  • Paypal has introduced a stablecoin that claims to be fully backed by US dollars.
Categories: Technology

The next generation of developer productivity

O'Reilly Radar - Tue, 2023/08/15 - 03:06

To follow up on our previous survey about low-code and no-code tools, we decided to run another short survey about tools specifically for software developers—including, but not limited to, GitHub Copilot and ChatGPT. We’re interested in how “developer enablement” tools of all sorts are changing the workplace. Our survey 1 showed that while these tools increased productivity, they aren’t without their costs. Both upskilling and retraining developers to use these tools are issues.

Few professional software developers will find it surprising that software development teams are respondents said that productivity is the biggest challenge their organization faced, and another 19% said that time to market and deployment speed are the biggest challenges. Those two answers are almost the same: decreasing time to market requires increasing productivity, and improving deployment speed is itself an increase in productivity. Together, those two answers represented 48% of the respondents, just short of half.

HR issues were the second-most-important challenge, but they’re nowhere near as pressing. 12% of the respondents reported that job satisfaction is the greatest challenge; 11% said that there aren’t good job candidates to hire; and 10% said that employee retention is the biggest issue. Those three challenges total 33%, just one-third of the respondents.

1 Our survey ran from April 18 to April 25, 2023. There were 739 responses.

It’s heartening to realize that hiring and retention are still challenges in this time of massive layoffs, but it’s also important to realize that these issues are less important than productivity.

But the big issue, the issue we wanted to explore, isn’t the challenges themselves; it’s what organizations are doing to meet them. A surprisingly large percentage of respondents (28%) aren’t making any changes to become more productive. But 20% are changing their onboarding and upskilling processes, 15% are hiring new developers, and 13% are using self-service engineering platforms.

We found that the biggest struggle for developers working with new tools is training (34%), and another 12% said the biggest struggle is “ease of use.” Together, that’s almost half of all respondents (46%). That was a surprise, since many of these tools are supposed to be low- or no-code. We’re thinking specifically about tools like GitHub Copilot, Amazon CodeWhisperer, and other code generators, but almost all productivity tools claim to make life simpler. At least at first, that’s clearly not true. There’s a learning curve, and it appears to be steeper than we’d have guessed. It’s also worth noting that 13% of the respondents said that the tools “didn’t effectively solve the problems that developers face.”

Over half of the respondents (51%) said that their organizations are using self-service deployment pipelines to increase productivity. Another 13% said that while they’re using self-service pipelines, they haven’t seen an increase in productivity. So almost two-thirds of the respondents are using self-service pipelines for deployment, and for most of them, the pipelines are working—reducing the overhead required to put new projects into production.

Finally, we wanted to know specifically about the effect of GitHub Copilot, ChatGPT, and other AI-based programming tools. Two-thirds of the respondents (67%) reported that these tools aren’t in use at their organizations. We suspect this estimate is lowballing Copilot’s actual usage. Back in the early 2000s, a widely quoted survey reported that CIOs almost unanimously said that their IT organizations weren’t making use of open source. How little they knew! Actual usage of Copilot, ChatGPT, and similar tools is likely to be much higher than 33%. We’re sure that even if they aren’t using Copilot or ChatGPT on the job, many programmers are experimenting with these tools or using them on personal projects.

What about the 33% who reported that Copilot and ChatGPT are in use at their organizations? First, realize that these are early adopters: Copilot was only released a year and a half ago, and ChatGPT has been out for less than a year. It’s certainly significant that they (and similar tools) have grabbed a third of the market in that short a period. It’s also significant that making a commitment to a new way of programming—and these tools are nothing if not a new kind of programming—is a much bigger change than, say, signing up for a ChatGPT account.

11% of the respondents said their organizations use Copilot and ChatGPT, and that the tools are primarily useful to junior developers; 13% said they’re primarily useful to senior developers. Another 9% said that the tools haven’t yielded an increase in productivity. The difference between junior and senior developers is closer than we expected. Common wisdom is that Copilot is more of an advantage to senior programmers, who are better able to describe the problem they need to solve in an intricate set of prompts and to notice bugs in the generated code quickly. Our survey hints that the difference between senior and junior developers is relatively small—although they’re almost certainly using Copilot in different ways. Junior developers are using it to learn and to spend less time solving problems by looking up solutions on Stack Overflow or searching online documentation. Senior developers are using it to help design and structure systems, and even to create production code.

Is developer productivity an issue? Of course; it always is. Part of the solution is improved tooling: self-service deployment, code-generation tools, and other new technologies and ideas. Productivity tools—and specifically the successors to tools like Copilot—are remaking software development in radical ways. Software developers are getting value from these tools, but don’t let the buzz fool you: that value doesn’t come for free. Nobody’s going to sit down with ChatGPT, type “Generate an enterprise application for selling shoes,” and come away with something worthwhile. Each has its own learning curve, and it’s easy to underestimate how steep that curve can be. Developer productivity tools will be a big part of the future; but to take full advantage of those tools, organizations will need to plan for skills development.

Categories: Technology

The ChatGPT Surge

O'Reilly Radar - Tue, 2023/08/08 - 03:12

I’m sure that nobody will be surprised that the number of searches for ChatGPT on the O’Reilly learning platform skyrocketed after its release in November, 2022. It might be a surprise how quickly it got to the top of our charts: it peaked in May as the 6th most common search query. Then it dropped almost as quickly: it dropped back to #8 in June, and fell further to #19 in July. At its peak, ChatGPT was in very exclusive company: it’s not quite on the level of Python, Kubernetes, and Java, but it’s in the mix with AWS and React, and significantly ahead of Docker.

A look at the number of searches for terms commonly associated with AI shows how dramatic this rise was:

ChatGPT came from nowhere to top all the AI search terms except for Machine Learning itself, which is consistently our #3 search term—and, despite ChatGPT’s dramatic decline in June and July, it’s still ahead of all other search terms relevant to AI. The number of searches for Machine Learning itself held steady, though it arguably declined slightly when ChatGPT appeared. What’s more interesting, though, is that the search term “Generative AI” suddenly emerged from the pack as the third most popular search term. If current trends continue, in August we might see more searches for Generative AI than for ChatGPT.

What can we make of this? Everyone knows that ChatGPT had one of the most successful launches of any software project, passing a million users in its first five days. (Since then, it’s been beaten by Facebook’s Threads, though that’s not really a fair comparison.) There are plenty of reasons for this surge. Talking computers have been a science fiction dream since well before Star Trek—by itself, that’s a good reason for the public’s fascination. ChatGPT might simplify common tasks, from doing research to writing essays to basic programming, so many people want to use it to save labor—though getting it to do quality work is more difficult than it seems at first glance. (We’ll leave the issue of whether this is “cheating” to the users, their teachers, and their employers.) And, while I’ve written frequently about how ChatGPT will change programming, it will undoubtedly have an even greater effect on non-programmers. It will give them the chance to tell computers what to do without programming; it’s the ultimate “low code” experience.

So there are plenty of reasons for ChatGPT to surge. What about other search terms? It’s easy to dismiss these search queries as also-rans, but they were all in the top 300 for May, 2023—and we typically have a few million unique search terms per month. Removing ChatGPT and Machine Learning from the previous graph makes it easier to see trends in the other popular search terms:

It’s mostly “up and to the right.” Three search terms stand out: Generative AI, LLM, and Langchain all follow similar curves: they start off with relatively moderate growth that suddenly becomes much steeper in February, 2023. We’ve already noted that the number of searches for Generative AI increased sharply since the release of ChatGPT, and haven’t declined in the past two months. Our users evidently prefer LLM to spelling out “Large Language Models,” but if you add these two search terms together, the total number of searches for July is within 1% of Generative AI. This surge didn’t really start until last November, when it was spurred by the appearance of ChatGPT—even though search terms like LLM were already in circulation because of GPT-3, DALL-E, StableDiffusion, Midjourney, and other language-based generative AI tools.

Unlike LLMs, Langchain didn’t exist prior to ChatGPT—but once it appeared, the number of searches took off rapidly, and didn’t decline in June and July. That makes sense; although it’s still early, Langchain looks like it will be the cornerstone of LLM-based software development. It’s a widely used platform for building applications that generate queries programmatically and that connects LLMs with each other, with databases, and with other software. Langchain is frequently used to look up relevant articles that weren’t in ChatGPT’s training data and package them as part of a lengthy prompt.

In this group, the only search term that seems to be in a decline is Natural Language Processing. Although large language models clearly fall into the category of NLP, we suspect that most users associate NLP with older approaches to building chatbots. Searches for Artificial Intelligence appear to be holding their own, though it’s surprising that there are so few searches for AI compared to Machine Learning. The difference stems from O’Reilly’s audience, which is relatively technical and prefers the more precise term Machine Learning. Nevertheless, the number of searches for AI rose with the release of ChatGPT, possibly because ChatGPT’s appeal wasn’t limited to the technical community.

Now that we’ve run through the data, we’re left with the big question: What happened to ChatGPT? Why did it decline from roughly 5,000 searches to slightly over 2,500 in a period of two months? There are many possible reasons. Perhaps students stopped using ChatGPT for homework assignments as graduation and summer vacation approached. Perhaps ChatGPT has saturated the world; people know what they need to know, and are waiting for the next blockbuster. An article in Ars Technica notes that ChatGPT usage declined from May to June, and suggests many possible causes, including attention to the Twitter/Threads drama and frustration because OpenAI implemented stricter guardrails to prevent abuse. It would be unfortunate if ChatGPT usage is declining because people can’t use it to generate abusive content, but that’s a different article…

A more important reason for this decline might be that ChatGPT is no longer the only game in town. There are now many alternative language models. Most of these alternatives descend from Meta’s LLaMA and Georgi Gerganov’s llama.cpp (which can run on laptops, cell phones, and even Raspberry Pi). Users can train these models to do whatever they want. Some of these models already have chat interfaces, and all of them could support chat interfaces with some fairly simple programming. None of these alternatives generate significant search traffic at O’Reilly, but that doesn’t mean that they won’t in the future, or that they aren’t an important part of the ecosystem. Their proliferation is an important piece of evidence about what’s happening among O’Reilly’s users. AI developers now need to ask a question that didn’t even exist last November: should they build on large foundation models like ChatGPT or Google’s Bard, using public APIs and paying by the token? Or should they start with an open source model that can run locally and be trained for their specific application?

This last explanation makes a lot of sense in context. We’ve moved beyond the initial phase, when ChatGPT was a fascinating toy. We’re now building applications and incorporating language models into products, so trends in search terms have shifted accordingly. A developer interested in building with large language models needs more context; learning about ChatGPT by itself isn’t enough. Developers who want to learn about language models need different kinds of information, information that’s both deeper and broader. They need to learn about how generative AI works, about new LLMs, about programming with Langchain and other platforms. All of these search terms increased while ChatGPT declined. Now that there are options, and now that everyone has had a chance to try out ChatGPT, the first step in an AI project isn’t to search for ChatGPT. It’s to get a sense of the landscape, to discover the possibilities.

Searches for ChatGPT peaked quickly, and are now declining rapidly—and who knows what August and September will bring? (We wouldn’t be surprised to see ChatGPT bounce back as students return to school and homework assignments.) The real news is that ChatGPT is no longer the whole story: you can’t look at the decline in ChatGPT without also considering what else our users are searching for as they start building AI into other projects. Large language models are very clearly part of the future. They will change the way we work and live, and we’re just at the start of the revolution.

Categories: Technology

Radar Trends to Watch: August 2023

O'Reilly Radar - Tue, 2023/08/01 - 03:04

Artificial Intelligence continues to dominate the news. In the past month, we’ve seen a number of major updates to language models: Claude 2, with its 100,000 token context limit; LLaMA 2, with (relatively) liberal restrictions on use; and Stable Diffusion XL, a significantly more capable version of Stable Diffusion. Does Claude 2’s huge context really change what the model can do? And what role will open access and open source language models have as commercial applications develop?

Artificial Intelligence
  • Stable Diffusion XL is a new generative model that expands on the abilities of Stable Diffusion. It promises shorter, easier prompts; the ability to generate text within images correctly; the ability to be trained on private data; and of course, higher quality output. Try it on clipdrop.
  • OpenAI has withdrawn OpenAI Classifier, a tool that was supposed to detect AI-generated text, because it was not accurate enough.
  • ChatGPT has added a new feature called “Custom Instructions.”  This feature lets users specify an initial prompt that ChatGPT processes prior to any other user-generated prompts; essentially, it’s a personal “system prompt.” Something to make prompt injection more fun.
  • Qualcomm is working with Facebook/Meta to run LLaMA 2 on small devices like phones, enabling AI applications to run locally. The distinction between open source and other licenses will prove much less important than the size of the machine on which the target runs.
  • StabilityAI has released two new large language models, FreeWilly1 and FreeWilly2. They are based on LLaMA and LLaMA 2 respectively. They are called Open Access (as opposed to Open Source), and claim performance similar to GPT 3.5 for some tasks.
  • Chatbot Arena lets chatbots do battle with each other. Users enter prompts, which are sent to two unnamed (randomly chosen?) language models. After the responses have been generated, users can declare a winner, and find out which models have been competing.
  • GPT-4’s ability to generate correct answers to problems may have degraded over the past few months—in particular, its ability to solve mathematical problems and generate correct Python code seems to have suffered. On the other hand, it is more robust against jailbreaking attacks.
  • Facebook/Meta has released Llama 2. While there are fewer restrictions on its use than other models, it is not open source despite Facebook’s claims.
  • Autochain is a lightweight, simpler alternative to Langchain. It allows developers to build complex applications on top of large language models and databases.
  • Elon Musk has announced his new AI company, xAI. Whether this will actually contribute to AI or be another sideshow is anyone’s guess.
  • Anthropic has announced Claude 2, a new version of their large language model. A chat interface is available at, and API access is available. Claude 2 allows prompts of up to 100,000 tokens, much larger than other LLMs, and can generate output up to “a few thousand tokens” in length.
  • parsel is a framework that helps large language models do a better job on tasks involving hierarchical multi-step reasoning and problem solving.
  • gpt-prompt-engineer is a tool that reads a description of the task you want an AI to perform, plus a number of test cases. It then generates a large number of prompts about a topic, tests the prompts, and rates the results.
  • LlamaIndex is a data framework (sometimes called an “orchestration framework”) for language models that simplifies the process of indexing a user’s data and using that data to build complex prompts for language models. It can be used with Langchain to build complex AI applications.
  • OpenAI is gradually releasing its Code Interpreter, which will allow ChatGPT to execute any code that it creates, using data provided by the user, and sending output back to the user. Code interpreter reduces hallucinations, errors, and bad math.
  • Humans can now beat AI at Go by finding and exploiting weaknesses in the AI system’s play, tricking the AI into making serious mistakes.
  • Time for existential questions: Does a single banana exist? Midjourney doesn’t think so. Seriously, this is an excellent article about the difficulty of designing prompts that deliver appropriate results.
  • The Jolly Roger Telephone Company has developed GPT–4-based voicebots that you can hire to answer your phone when telemarketers call. If you want to listen in, the results can be hilarious.
  • Apache Spark now has an English SDK. It goes a step beyond tools like CoPilot, allowing you to use English directly when writing code.
  • Humans may be more likely to believe misinformation generated by AI, possibly because AI-generated text is better structured than most human text. Or maybe because AIs are very good at being convincing.
  • OpenOrca is yet another LLaMA-based open source language model and dataset. Its goal is to reproduce the training data for Microsoft’s Orca, which was trained using chain-of-thought prompts and responses from GPT-4. The claim for both Orca models is that it can reproduce GPT-4’s “reasoning” processes.
  • At its developer summit, Snowflake announced Document AI: natural language queries of collections of unstructured documents. This product is based on their own large language model, not an AI provider.
  • “It works on my machine” has become “It works in my container”: This article has some good suggestions about how to avoid a problem that has plagued computer users for decades.
  • StackOverflow is integrating AI into its products. StackOverflow for Teams now has a chatbot to help solve technical problems, along with a new GenAI StackExchange for discussing generative AI, prompt writing, and related issues.
  • It isn’t news that GitHub can leak private keys and authentication secrets. But a study of the containers available on DockerHub shows that Docker containers also leak keys and secrets, and many of these keys are in active use.
  • Firejail is a Linux tool that can run any process in a private, secure sandbox.
  • Complex and complicated: what’s the difference? It has to do with information, and it’s important to understand in an era of “complex systems.” First in a series.
  • npm-manifest-check is a tool that checks the contents of a package in NPM against the package’s manifest. It is a partial solution to the problem of malicious packages in NPM.
  • Facebook has described their software development platform, much of which they have open sourced. Few developers have to work with software projects this large, but their tools (which include testing frameworks, version control, and a build system) are worth investigating.
  • Polyrhythmix is a command-line program for generating polyrhythmic drum parts. No AI involved.
  • Philip Guo’s “Real-Real-World Programming with ChatGPT” shows what it’s like to use ChatGPT to do a real programming task: what works well, what doesn’t.
  • A research group has found a way to automatically generate attack strings that force large language models to generate harmful content. These attacks work against both open- and closed-source models. It isn’t clear that AI providers can defend against them.
  • The cybercrime syndicate Lazarus Group is running a social engineering attack against JavaScript cryptocurrency developers. Developers are invited to collaborate on a Github project that depends on malicious NPM packages.
  • Language models are the next big thing in cybercrime. A large language model called WormGPT has been developed for use by cybercriminals. It is based on GPT-J. WormGPT is available on the dark web along with thousands of stolen ChatGPT credentials.
  • According to research by MITRE, out-of-bounds writes are among the most dangerous security bugs. They are also the most common, and are consistently at the top of the list. An easy solution to the problem is to use Rust.
  • Another web framework? Enhance claims to be HTML-first, with JavaScript only if you need it. The reality may not be that simple, but if nothing else, it’s evidence of growing dissatisfaction with complex and bloated web applications.
  • Another new browser? Arc rethinks the browsing experience with the ability to switch between groups of tabs and customize individual websites.
  • HTMX provides a way of using HTML attributes to build many advanced web page features, including WebSockets and what we used to call Ajax. All the complexity appears to be packaged into one JavaScript library.
  • There is a law office in the Metaverse, along with a fledgling Metaverse Bar Association. It’s a good place for meetings, although lawyers cannot be licensed to practice in the Metaverse.
  • The European Court of Justice (CJEU) has ruled that Meta’s approach to GDPR compliance is illegal. Meta may not use data for anything other than core functionality without explicit, freely-given consent; consent hidden in the terms of use document does not suffice.
  • Google has updated its policy on Android apps to allow apps to give blockchain-based assets such as NFTs.
  • ChatGPT can be programmed to send Bitcoin payments. As the first commenter points out, this is a fairly simple application of Langchain. But it’s something that was certainly going to happen. But it begs the question: when will we have GPT-based cryptocurrency arbitrage?
  • Google has developed Med-PaLM M, an attempt at building a “generalist” multimodal AI that has been trained for biomedical applications. Med-PaLM M is still a research project, but may represent a step forward in the application of large language models to medicine.
  • Room temperature ambient pressure superconductors: This claim has met with a lot of skepticism—but as always, it’s best to wait until another team succeeds or fails to duplicate the results. If this research holds up, it’s a huge step forward.
Categories: Technology

Real-Real-World Programming with ChatGPT

O'Reilly Radar - Tue, 2023/07/25 - 03:49

If you’re reading this, chances are you’ve played around with using AI tools like ChatGPT or GitHub Copilot to write code for you. Or even if you haven’t yet, then you’ve at least heard about these tools in your newsfeed over the past year. So far I’ve read a gazillion blog posts about people’s experiences with these AI coding assistance tools. These posts often recount someone trying ChatGPT or Copilot for the first time with a few simple prompts, seeing how it does for some small self-contained coding tasks, and then making sweeping claims like “WOW this exceeded all my highest hopes and wildest dreams, it’s going to replace all programmers in five years!” or “ha look how incompetent it is … it couldn’t even get my simple question right!”

I really wanted to go beyond these quick gut reactions that I’ve seen so much of online, so I tried using ChatGPT for a few weeks to help me implement a hobby software project and took notes on what I found interesting. This article summarizes what I learned from that experience. The inspiration (and title) for it comes from Mike Loukides’ Radar article on Real World Programming with ChatGPT, which shares a similar spirit of digging into the potential and limits of AI tools for more realistic end-to-end programming tasks.

Setting the Stage: Who Am I and What Am I Trying to Build?

I’m a professor who is interested in how we can use LLMs (Large Language Models) to teach programming. My student and I recently published a research paper on this topic, which we summarized in our Radar article Teaching Programming in the Age of ChatGPT. Our paper reinforces the growing consensus that LLM-based AI tools such as ChatGPT and GitHub Copilot can now solve many of the small self-contained programming problems that are found in introductory classes. For instance, problems like “write a Python function that takes a list of names, splits them by first and last name, and sorts by last name.” It’s well-known that current AI tools can solve these kinds of problems even better than many students can. But there’s a huge difference between AI writing self-contained functions like these and building a real piece of software end-to-end. I was curious to see how well AI could help students do the latter, so I wanted to first try doing it myself.

I needed a concrete project to implement with the help of AI, so I decided to go with an idea that had been in the back of my head for a while now: Since I read a lot of research papers for my job, I often have multiple browser tabs open with the PDFs of papers I’m planning to read. I thought it would be cool to play music from the year that each paper was written while I was reading it, which provides era-appropriate background music to accompany each paper. For instance, if I’m reading a paper from 2019, a popular song from that year could start playing. And if I switch tabs to view a paper from 2008, then a song from 2008 could start up. To provide some coherence to the music, I decided to use Taylor Swift songs since her discography covers the time span of most papers that I typically read: Her main albums were released in 2006, 2008, 2010, 2012, 2014, 2017, 2019, 2020, and 2022. This choice also inspired me to call my project Swift Papers.

Swift Papers felt like a well-scoped project to test how well AI handles a realistic yet manageable real-world programming task. Here’s how I worked on it: I subscribed to ChatGPT Plus and used the GPT-4 model in ChatGPT (first the May 12, 2023 version, then the May 24 version) to help me with design and implementation. I also installed the latest VS Code (Visual Studio Code) with GitHub Copilot and the experimental Copilot Chat plugins, but I ended up not using them much. I found it easier to keep a single conversational flow within ChatGPT rather than switching between multiple tools. Lastly, I tried not to search for help on Google, Stack Overflow, or other websites, which is what I would normally be doing while programming. In sum, this is me trying to simulate the experience of relying as much as possible on ChatGPT to get this project done.

Getting Started: Setup Trials and Tribulations

Here’s the exact prompt I used to start my conversation with ChatGPT using GPT-4:

Act as a software developer to help me build something that will play music from a time period that matches when an academic paper I am reading in the browser was written.

I purposely kept this prompt high-level and underspecified since I wanted ChatGPT to guide me toward design and implementation ideas without me coming in with preconceived notions.

ChatGPT immediately suggested a promising direction—making a browser extension that gets the date of the research paper PDF in the currently-active tab and calls a music streaming API to play a song from that time period. Since I already had a YouTube Music account, I asked whether I could use it, but ChatGPT said that YouTube Music doesn’t have an API. We then brainstormed alternative ideas like using a browser automation tool to programmatically navigate and click on parts of the YouTube Music webpage. ChatGPT gave me some ideas along these lines but warned me that, “It’s important to note that while this approach doesn’t use any official APIs, it’s more brittle and more subject to break if YouTube Music changes their website structure. […] keep in mind that web scraping and browser automation can be complex, and handling all of the edge cases can be a significant amount of work. […] using APIs might be a more reliable and manageable solution.” That warning convinced me to drop this idea. I recalled that ChatGPT had recommended the Spotify Web API in an earlier response, so I asked it to teach me more about what it can do and tell me why I should use it rather than YouTube Music. It seemed like Spotify had what I needed, so I decided to go with it. I liked how ChatGPT helped me work through the tradeoffs of these initial design decisions before diving head-first into coding.

Next we worked together to set up the boilerplate code for a Chrome browser extension, which I’ve never made before. ChatGPT started by generating a manifest.json file for me, which holds the configuration settings that every Chrome extension needs. I didn’t know it at the time, but manifest.json would cause me a bunch of frustration later on. Specifically:

  • ChatGPT generated a manifest.json file in the old Version 2 (v2) format, which is unsupported in the current version of Chrome. For a few years now Google has been transitioning developers to v3, which I didn’t know about since I had no prior experience with Chrome extensions. And ChatGPT didn’t warn me about this. I guessed that maybe ChatGPT only knew about v2 since it was trained on open-source code from before September 2021 (its knowledge cutoff date) and v2 was the dominant format before that date. When I tried loading the v2 manifest.json file into Chrome and saw the error message, I told ChatGPT “Google says that manifest version 2 is deprecated and to upgrade to version 3.” To my surprise, it knew about v3 from its training data and generated a v3 manifest file for me in response. It even told me that v3 is the currently-supported version (not v2!) … yet it still defaulted to v2 without giving me any warning! This frustrated me even more than if ChatGPT had not known about v3 in the first place (in that case I wouldn’t blame it for not telling me something that it clearly didn’t know). This theme of sub-optimal defaults will come up repeatedly—that is, ChatGPT ‘knows’ what the optimal choice is but won’t generate it for me without me asking for it. The dilemma is that someone like me who is new to this area wouldn’t even know what to ask for in the first place.
  • After I got the v3 manifest working in Chrome, as I tried using ChatGPT to help me add more details to my manifest.json file, it tended to “drift” back to generating code in v2 format. I had to tell it a few times to only generate v3 code from now on, and I still didn’t fully trust it to follow my directive. Besides generating code for v2 manifest files, it also generated starter JavaScript code for my Chrome extension that works only with v2 instead of v3, which led to more mysterious errors. If I were to start over knowing what I do now, my initial prompt would have sternly told ChatGPT that I wanted to make an extension using v3, which would hopefully avoid it leading me down this v2 rabbit hole.
  • The manifest file that ChatGPT generated for me declared the minimal set of permissions—it only listed the activeTab permission, which grants the extension limited access to the active browser tab. While this has the benefit of respecting user privacy by minimizing permissions (which is a best practice that ChatGPT may have learned from its training data), it made my coding efforts a lot more painful since I kept running into unexpected errors when I tried adding new functionality to my Chrome extension. Those errors often showed up as something not working as intended, but Chrome wouldn’t necessarily display a permission denied message. In the end, I had to add four additional permissions—”tabs”,  “storage”, “scripting”, “identity”—as well as a separate “host_permissions” field to my manifest.json.

Wrestling with all these finicky details of manifest.json before I could begin any real coding felt like death by a thousand cuts. In addition, ChatGPT generated other starter code in the chat, which I copied into new files in my VS Code project:

Intermission 1: ChatGPT as a Personalized Tutor

As shown above, a typical Chrome extension like mine has at least three JavaScript files: a background script, a content script, and a pop-up script. At this point I wanted to learn more about what all these files are meant to do rather than continuing to obediently copy-paste code from ChatGPT into my project. Specifically, I discovered that each file has different permissions for what browser or page components it can access, so all three must coordinate to make the extension work as intended. Normally I would read tutorials about how this all fits together, but the problem with tutorials is that they are not customized to my specific use case. Tutorials provide generic conceptual explanations and use made-up toy examples that I can’t relate to. So I end up needing to figure out how their explanations may or may not apply to my own context.

In contrast, ChatGPT can generate personalized tutorials that use my own Swift Papers project as the example in its explanations! For instance, when it explained to me what a content script does, it added that “For your specific project, a content script would be used to extract information (the publication date) from the academic paper’s webpage. The content script can access the DOM of the webpage, find the element that contains the publication date, and retrieve the date.” Similarly, it taught me that “Background scripts are ideal for handling long-term or ongoing tasks, managing state, maintaining databases, and communicating with remote servers. In your project, the background script could be responsible for communicating with the music API, controlling the music playback, and storing any data or settings that need to persist between browsing sessions.”

I kept asking ChatGPT follow-up questions to get it to teach me more nuances about how Chrome extensions worked, and it grounded its explanations in how those concepts applied to my Swift Papers project. To accompany its explanations, it also generated relevant example code that I could try out by running my extension. These explanations clicked well in my head because I was already deep into working on Swift Papers. It was a much better learning experience than, say, reading generic getting-started tutorials that walk through creating example extensions like “track your page reading time” or “remove clutter from a webpage” or “manage your tabs better” … I couldn’t bring myself to care about those examples since THEY WEREN’T RELEVANT TO ME! At the time, I cared only about how these concepts applied to my own project, so ChatGPT shined here by generating personalized mini-tutorials on-demand.

Another great side-effect of ChatGPT teaching me these concepts directly within our ongoing chat conversation is that whenever I went back to work on Swift Papers after a few days away from it, I could scroll back up in the chat history to review what I recently learned. This reinforced the knowledge in my head and got me back into the context of resuming where I last left off. To me, this is a huge benefit of a conversational interface like ChatGPT versus an IDE autocomplete interface like GitHub Copilot, which doesn’t leave a trace of its interaction history. Even though I had Copilot installed in VS Code as I was working on Swift Papers, I rarely used it (beyond simple autocompletions) since I liked having a chat history in ChatGPT to refer back to in later sessions.

Next Up: Choosing and Installing a Date Parsing Library

Ideally Swift Papers would infer the date when an academic paper was written by analyzing its PDF file, but that seemed too hard to do since there isn’t a standard place within a PDF where the publication date is listed. Instead what I decided to do was to parse the “landing pages” for each paper that contains metadata such as its title, abstract, and publication date. Many papers I read are linked from a small handful of websites, such as the ACM Digital Library, arXiv, or Google Scholar, so I could parse the HTML of those landing pages to extract publication dates. For instance, here’s the landing page for the classic Beyond being there paper:

I wanted to parse the “Published: 01 June 1992” string on that page to get 1992 as the publication year. I could’ve written this code by hand, but I wanted to try using a JavaScript date parsing library since it would be more robust to date format variations that appear on various websites (e.g., using “22” for the year 2022). Also, since any real-world software project will need to use external libraries, I wanted to see how well ChatGPT could help me choose and install libraries.

ChatGPT suggested two libraries for me: Moment.js and chrono-node. However, it warned me about Moment.js: “as of September 2020, it is considered a legacy project and not recommended for new projects as the team is not planning on doing any new development or maintenance.” I verified this was true by seeing the same warning on the Moment.js homepage. But still, I liked how Moment.js was available as a single self-contained file that I could directly include into my extension without using a package manager like npm or a bundler like webpack (the fewer external tools I needed to set up, the better!). Or so I thought … ChatGPT led me to believe that I could get by without npm and webpack, but later I discovered that this only works in the old Manifest v2 extension format that is no longer supported by Chrome. It turns out that with Manifest v3, all external libraries must be bundled together using tools like npm and webpack in order for the extension to import them. So it goes… once more I got bitten by the fact that ChatGPT was biased toward producing outdated information for v2 rather than the newer v3, presumably because there was a lot more information about v2 in its training data.

Since I had to use npm and webpack anyways, I decided to go with chrono-node since it seemed more robust and up-to-date (no pun intended). I had ChatGPT help me set up my webpack configuration file (webpack.config.js) and things almost seemed to work, except that I got a mysterious error. When I pasted the error message into ChatGPT, it correctly diagnosed the problem, which was something related to (surprise surprise!) webpack’s default settings not being compatible with the security restrictions of my extension’s Manifest v3 format. It also suggested a working fix to webpack.config.js:

This error message is related to the Content Security Policy (CSP) of Chrome extensions. […] By default, Manifest V3 disallows the use of eval() and the Function() constructor, which are typically used for evaluating or creating code from strings […] Webpack, by default, includes a small amount of inline JavaScript code for handling hot module replacement and some other features. This code uses eval() which is not allowed in the context of a Chrome extension. To solve this, you can adjust your webpack configuration to output code in a format that doesn’t rely on eval(). In your webpack.config.js, you can set the devtool option to ‘none’ or use the ‘source-map’ setting […]

Here again ChatGPT showed me that it clearly knew what the problem was (since it told me after I fed it the error message!) and how to fix it. So why didn’t it produce the correct webpack configuration file in the first place?

More generally, several times I’ve seen ChatGPT produce code that I felt might be incorrect. Then when I tell it that there might be a bug in a certain part, it admits its mistake and produces the correct code in response. If it knew that its original code was incorrect, then why didn’t it generate the correct code in the first place?!? Why did I have to ask it to clarify before it admitted its mistake? I’m not an expert at how LLMs work internally, but my layperson guess is that it may have to do with the fact that ChatGPT generates code linearly one token at a time, so it may get ‘stuck’ near local maxima (with code that mostly works but is incorrect in some way) while it is navigating the enormous abstract space of possible output code tokens; and it can’t easily backtrack to correct itself as it generates code in a one-way linear stream. But after it finishes generating code, when the user asks it to review that code for possible errors, it can now “see” and analyze all of that code at once. This comprehensive view of the code may enable ChatGPT to find bugs better, even if it couldn’t avoid introducing those bugs in the first place due to how it incrementally generates code in a one-way stream. (This isn’t an accurate technical explanation, but it’s how I informally think about it.)

Intermission 2: ChatGPT as a UX Design Consultant

Now that I had a basic Chrome extension that could extract paper publication dates from webpages, the next challenge was using the Spotify API to play era-appropriate Taylor Swift songs to accompany these papers. But before embarking on another coding-intensive adventure, I wanted to switch gears and think more about UX (user experience). I got so caught up in the first few hours of getting my extension set up that I hadn’t thought about how this app ought to work in detail. What I needed at this time was a UX design consultant, so I wanted to see if ChatGPT could play this role.

Note that up until now I had been doing everything in one long-running chat session that focused on coding-related questions. That was great because ChatGPT was fully “in the zone” and had a very long conversation (spanning several hours over multiple days) to use as context for generating code suggestions and technical explanations. But I didn’t want all that prior context to influence our UX discussion, so I decided to begin again by starting a brand-new session with the following prompt:

You are a Ph.D. graduate in Human-Computer Interaction and now a senior UX (user experience) designer at a top design firm. Thus, you are very familiar with both the experience of reading academic papers in academia and also designing amazing user experiences in digital products such as web applications. I am a professor who is creating a Chrome Extension for fun in order to prototype the following idea: I want to make the experience of reading academic papers more immersive by automatically playing Taylor Swift songs from the time period when each paper was written while the reader is reading that particular paper in Chrome. I have already set up all the code to connect to the Spotify Web API to programmatically play Taylor Swift songs from certain time periods. I have also already set up a basic Chrome Extension that knows what webpages the user has open in each tab and, if it detects that a webpage may contain metadata about an academic paper then it parses that webpage to get the year the paper was written in, in order to tell the extension what song to play from Spotify. That is the basic premise of my project.

Your job is to serve as a UX design consultant to help me design the user experience for such a Chrome Extension. Do not worry about whether it is feasible to implement the designs. I am an experienced programmer so I will tell you what ideas are or are not feasible to implement. I just want your help with thinking through UX design.

As our session progressed, I was very impressed with ChatGPT’s ability to help me brainstorm how to handle different user interaction scenarios. That said, I had to give it some guidance upfront using my knowledge of UX design: I started by asking it to come up with a few user personas and then to build up some user journeys for each. Given this initial prompting, ChatGPT was able to help me come up with practical ideas that I didn’t originally consider all too well, especially for handling unusual edge cases (e.g., what should happen to the music when the user switches between tabs very quickly?). The back-and-forth conversational nature of our chat made me feel like I was talking to a real human UX design consultant.

I had a lot of fun working with ChatGPT to refine my initial high-level ideas into a detailed plan for how to handle specific user interactions within Swift Papers. The culmination of our consulting session was ChatGPT generating ASCII diagrams of user journeys through Swift Papers, which I could later refer to when implementing this logic in code. Here’s one example:

Reflecting back, this session was productive because I was familiar enough with UX design concepts to steer the conversation towards more depth. Out of curiosity, I started a new chat session with exactly the same UX consultant prompt as above but then played the part of a total novice instead of guiding it:

I don’t know anything about UX design. Can you help me get started since you are the expert?

The conversation that followed this prompt was far less useful since ChatGPT ended up giving me a basic primer on UX Design 101 and offering high-level suggestions for how I can start thinking about the user experience of Swift Papers. I didn’t want to nudge it too hard since I was pretending to be a novice, and it wasn’t proactive enough to ask me clarifying questions to probe deeper. Perhaps if I had prompted it to be more proactive at the start, then it could have elicited more information even from a novice.

This digression reinforces the widely-known consensus that what you get out of LLMs like ChatGPT is only as good as the prompts you’re able to put in. There’s all of this relevant knowledge hiding inside its neural network mastermind of billions and billions of LLM parameters, but it’s up to you to coax it into revealing what it knows by taking the lead in conversations and crafting the right prompts to direct it toward useful responses. Doing so requires a degree of expertise in the domain you’re asking about, so it’s something that beginners would likely struggle with.

The Last Big Hurdle: Working with the Spotify API

After ChatGPT helped me with UX design, the last hurdle I had to overcome was figuring out how to connect my Chrome extension to the Spotify Web API to select and play music. Like my earlier adventure with installing a date parsing library, connecting to web APIs is another common real-world programming task, so I wanted to see how well ChatGPT could help me with it.

The gold standard here is an expert human programmer who has a lot of experience with the Spotify API and who is good at teaching novices. ChatGPT was alright for getting me started but ultimately didn’t meet this standard. My experience here showed me that human experts still outperform the current version of ChatGPT along the following dimensions:

  • Context, context, context: Since ChatGPT can’t “see” my screen, it lacks a lot of useful task context that a human expert sitting beside me would have. For instance, connecting to a web API requires a lot of “pointing-and-clicking” manual setup work that isn’t programming: I had to register for a paid Spotify Premium account to grant me API access, navigate through its web dashboard interface to create a new project, generate API keys and insert them into various places in my code, then register a URL where my app lives in order for authentication to work. But what URL do I use? Swift Papers is a Chrome extension running locally on my computer rather than online, so it doesn’t have a real URL. I later discovered that Chrome extensions export a fake URL that can be used for web API authentication. A human expert who is pair programming with me would know all these ultra-specific idiosyncrasies and guide me through pointing-and-clicking on the various dashboards to put all the API keys and URLs in the right places. In contrast, since ChatGPT can’t see this context, I have to explicitly tell it what I want at each step. And since this setup process was so new to me, I had a hard time thinking about how to phrase my questions. A human expert would be able to see me struggling and step in to offer proactive assistance for getting me unstuck.
  • Bird’s-eye view: A human expert would also understand what I’m trying to do—selecting and playing date-appropriate songs—and guide me on how to navigate the labyrinth of the sprawling Spotify API in order to do it. In contrast, ChatGPT doesn’t seem to have as much of a bird’s-eye view, so it eagerly barrels ahead to generate code with specific low-level API calls whenever I ask it something. I, too, am eager to follow its lead since it sounds so confident each time it suggests code along with a convincing explanation (LLMs tend to adopt an overconfident tone, even if their responses may be factually inaccurate). That sometimes leads me on a wild goose chase down one direction only to realize that it’s a dead-end and that I have to backtrack. More generally, it seems hard for novices to learn programming in this piecemeal way by churning through one ChatGPT response after another rather than having more structured guidance from a human expert.
  • Tacit (unwritten) knowledge: The Spotify API is meant to control an already-open Spotify player (e.g., the web player or a dedicated app), not to directly play songs. Thus, ChatGPT told me it was not possible to use it to play songs in the current browser tab, which Swift Papers needed to do. I wanted to verify this for myself, so I went back to “old-school” searching the web, reading docs, and looking for example code online. I found that there was conflicting and unreliable information about whether it’s even possible to do this. And since ChatGPT is trained on text from the internet, if that text doesn’t contain high-quality information about a topic, then ChatGPT won’t work well for it either. In contrast, a human expert can draw upon their vast store of experience from working with the Spotify API in order to teach me tricks that aren’t well-documented online. In this case, I eventually figured out a hack to get playback working by forcing a Spotify web player to open in a new browser tab, using a super-obscure and not-well-documented API call to make that player ‘active’ (or else it sometimes won’t respond to requests to play … that took me forever to figure out, and ChatGPT kept giving me inconsistent responses that didn’t work), and then playing music within that background tab. I feel that humans are still better than LLMs at coming up with these sorts of hacks since there aren’t readily-available online resources to document them. A lot of this hard-earned knowledge is tacit and not written down anywhere, so LLMs can’t be trained on it.
  • Lookahead: Lastly, even in instances when ChatGPT could help out by generating good-quality code, I often had to manually update other source code files to make them compatible with the new code that ChatGPT was giving me. For instance, when it suggested an update to a JavaScript file to call a specific Chrome extension API function, I also had to modify my manifest.json to grant an additional permission before that function call could work (bitten by permissions again!). If I didn’t know to do that, then I would see some mysterious error message pop up, paste it into ChatGPT, and it would sometimes give me a way to fix it. Just like earlier, ChatGPT “knows” the answer here, but I must ask it the right question at every step along the way, which can get exhausting. This is especially a problem for novices since we often don’t know what we don’t know, so we don’t know what to even ask for in the first place! In contrast, a human expert who is helping me would be able to “look ahead” a few steps based on their experience and tell me what other files I need to edit ahead of time so I don’t get bitten by these bugs in the first place.

In the end I got this Spotify API setup working by doing some old-fashioned web searching to supplement my ChatGPT conversation. (I did try the ChatGPT + Bing web search plugin for a bit, but it was slow and didn’t produce useful results, so I couldn’t tolerate it any more and just shut it off.) The breakthrough came as I was browsing a GitHub repository of Spotify Web API example code. I saw an example for Node.js that seemed to do what I wanted, so I copy-pasted that code snippet into ChatGPT and told it to adapt the example for my Swift Papers app (which isn’t using Node.js):

Here’s some example code using Implicit Grant Flow from Spotify’s documentation, which is for a Node.js app. Can you adapt it to fit my chrome extension? [I pasted the code snippet here]

ChatGPT did a good job at “translating” that example into my context, which was exactly what I needed at the moment to get unstuck. The code it generated wasn’t perfect, but it was enough to start me down a promising path that would eventually lead me to get the Spotify API working for Swift Papers. Reflecting back, I later realized that I had manually done a simple form of RAG (Retrieval Augmented Generation) here by using my intuition to retrieve a small but highly-relevant snippet of example code from the vast universe of all code on the internet and then asking a super-specific question about it. (However, I’m not sure a beginner would be able to scour the web to find such a relevant piece of example code like I did, so they would probably still be stuck at this step because ChatGPT alone wasn’t able to generate working code without this extra push from me.)

Epilogue: What Now?

I have a confession: I didn’t end up finishing Swift Papers. Since this was a hobby project, I stopped working on it after about two weeks when my day-job got more busy. However, I still felt like I completed the initial hard parts and got a sense of how ChatGPT could (and couldn’t) help me along the way. To recap, this involved:

  • Setting up a basic Chrome extension and familiarizing myself with the concepts, permission settings, configuration files, and code components that must coordinate together to make it all work.
  • Installing third-party JavaScript libraries (such as a date parsing library) and configuring the npm and webpack toolchain so that these libraries work with Chrome extensions, especially given the strict security policies of Manifest v3.
  • Connecting to the Spotify Web API in such a way to support the kinds of user interactions that I needed in Swift Papers and dealing with the idiosyncrasies of accessing this API via a Chrome extension.
  • Sketching out detailed UX journeys for the kinds of user interactions to support and how Swift Papers can handle various edge cases.

After laying this groundwork, I was able to start getting into the flow of an edit-run-debug cycle where I knew exactly where to add code to implement a new feature, how to run it to assess whether it did what I intended, and how to debug. So even though I stopped working on this project due to lack of time, I got far enough to see how completing Swift Papers would be “just a matter of programming.” Note that I’m not trying to trivialize the challenges involved in programming, since I’ve done enough of it to know that the devil is in the details. But these coding-specific details are exactly where AI tools like ChatGPT and GitHub Copilot shine! So even if I had continued adding features throughout the coming weeks, I don’t feel like I would’ve gotten any insights about AI tools that differ from what many others have already written about. That’s because once the software environment has been set up (e.g., libraries, frameworks, build systems, permissions, API authentication keys, and other plumbing to hook things together), then the task at hand reduces to a self-contained and well-defined programming problem, which AI tools excel at.

In sum, my goal in writing this article was to share my experiences using ChatGPT for the more open-ended tasks that came before my project turned into “just a matter of programming.” Now, some may argue that this isn’t “real” programming since it feels like just a bunch of mundane setup and configuration work. But I believe that if “real-world” programming means creating something realistic with code, then “real-real-world” programming (the title of this article!) encompasses all these tedious and idiosyncratic errands that are necessary before any real programming can begin. And from what I’ve experienced so far, this sort of work isn’t something humans can fully outsource to AI tools yet. Long story short, someone today can’t just give AI a high-level description of Swift Papers and have a robust piece of software magically pop out the other end. I’m sure people are now working on the next generation of AI that can bring us closer to this goal (e.g., much longer context windows with Claude 2 and retrieval augmented generation with Cody), so I’m excited to see what’s in store. Perhaps future AI tool developers could use Swift Papers as a benchmark to assess how well their tool performs on an example real-real-world programming task. Right now, widely-used benchmarks for AI code generation (e.g., HumanEval, MBPP) consist of small self-contained tasks that appear in introductory classes, coding interviews, or programming competitions. We need more end-to-end, real-world benchmarks to drive improvements in these AI tools.

Lastly, switching gears a bit, I also want to think more in the future about how AI tools can teach novices the skills they need to create realistic software projects like Swift Papers rather than doing all the implementation work for them. At present, ChatGPT and Copilot are reasonably good “doers” but not nearly as good at being teachers. This is unsurprising since they were designed to carry out instructions like a good assistant would, not to be an effective teacher who provides pedagogically-meaningful guidance. With the proper prompting and fine-tuning, I’m sure they can do much better here, and organizations like Khan Academy are already customizing GPT-4 to become a personalized tutor. I’m excited to see how things progress in this fast-moving space in the coming months and years. In the meantime, for more thoughts about AI coding tools in education, check out this other recent Radar article that I co-authored, Teaching Programming in the Age of ChatGPT, which summarizes our research paper about this topic.

Categories: Technology
Subscribe to LuftHans aggregator