I started work in 1989, in an office in Central London. I was a graduate trainee, and shared an office with a few secretaries. Not executive assistants - secretaries, whose job it was to take tapes recorded by the management team, and turn those into typed-up documents. Every morning, they’d get envelopes in the internal mail with dozens of tapes, and spend the day typing them up. Some used WANG word processors (dedicated computers that only ran word processing software); most used electric typewriters. The head of the typing pool would spell check important documents; she was formidable, and would often suggest grammatical improvements to “tighten it up a bit”. For the typists using electric typewriters, that meant starting over again…

We had an internal mail department, who came round twice a day to pick up and deliver envelopes for distribution within the company. For urgent messages between offices, you could ask for a fax to be sent - but only if you had a cost approval code, because faxes were expensive. Some people had direct phonelines - very much a status symbol - but most of us received calls via a central switchboard, staffed by the receptionists. Our sales team had lots of sales administrators, whose job was to take orders over the phone and make sure they were processed properly - initially, that meant filling out paper forms for the fulfilment teams; later, and partly as a result of the work I did, that meant creating orders on “the computer”, a mainframe system running our own, home-built ERP system.

I joined the IT department as a trainee; I learned COBOL, but gravitated towards a newfangled concept called the relational database, along with it structured query language (SQL). The IT department’s job was to write software for the mainframe - we wrote payroll, order processing, inventory management and accounting software, gradually replacing the paper systems that had run the business historically.

None of those jobs exist anymore - at least not in the same form. Typists, WANG operators, internal mailrooms, switchboard operators - I haven’t seen them for decades. Internal sales teams do much more than transcribing orders, and IT departments don’t write their own payroll systems any more.

Instead, everyone types their own documents, sends them via email or messaging systems, uses mobile phones. IT departments buy off-the-shelf solutions for common business tasks - usually operated by SaaS vendors; in many organisations, the IT department is primarily engaged in managing vendors.

And yet…there hasn’t been a huge spike in unemployment. The typists I shared an office with went on to become office managers, sales people, finance administrators. One went off to become a teacher. I caught up with one of my IT colleagues about 15 years later - they’d joined a start-up building a payroll SaaS platform.

My job as a programmer depended heavily on my ability to memorize syntax, and to type it into the computer correctly, first time. There were no “visual editors” - you had to type the instructions into the screen perfectly, line by line, or start all over again. Feedback was slow - it could take a good 15 minutes for a SQL query to return results, even on modest data sets. I would usually write out my code in pencil first, maybe show it to a colleague for feedback, go to lunch and re-check it after, and then, slowly, type it into the mini computer’s terminal. Building a new data model would take a week or more; making a mistake halfway through would often mean starting all over again. I didn’t have a computer terminal on my desk - it was covered in printouts of code, manuals, specification documents. The office soundscape was dominated by daisywheel typewriters, dot matrix printers, and the squeals of the (then modern) fax machines.

My project was to design and implement a database to award customer rebates. The model was fairly complex - rebates were awarded based on a wide range of criteria, there were exceptions everywhere, and identifying the requirements required reading the rebate contracts which was not much fun. The whole process took 9 months, and covered only a subset of customers. Building and populating the data schema took months; writing the queries took longer.

Even without generative AI, that process today would take days or weeks. The tooling is so much better. I barely need to remember syntax - the IDE just does that for me. I can automate everything so that I can easily go back and change fairly big design decisions without throwing away work. And even with large amounts of data, queries are lightning fast.

And today, as an experiment, I fed what I remember of that project (it was a long time ago!) into ChatGPT and asked it to design a schema and key queries, and it did so in minutes; with a few tweaked prompts, we got to a pretty credible solution in about 30 minutes. And a quick Google search shows that there are several off-the-shelf SaaS rebate management systems.

So the cost of solving a fairly common business problem has gone from “months or years of developer time” to “a small percentage per transaction”. The result is not that we have completely satisfied the demand for rebate systems, and then laid off everyone who worked on those systems - the result is that more and more organisations offer rebates. I now get rebates on train tickets, coffee purchases, holidays etc.

I read an article exploring this concept (can’t find it now!) about how reducing the cost of electrical lighting hasn’t reduced our consumption of electricity - it has increased, because things are now economically (and technically) viable that previously were not, for instance the huge sphere in Las Vegas.

Las Vegas sphere Image credit: By Cory Doctorow from Beautiful Downtown Burbank, USA - The Sphere as Mars, view from my hotel room at Harrah’s, Las Vegas, Nevada, USA, CC BY-SA 2.0, https://commons.wikimedia.org/w/index.php?curid=136454844

This huge building features 54,000 m2 of external LED displays which have shown a range of stunning images. The energy consumption of the Sphere, running at full capacity, is estimated to be 28 megawatts - enough to power 21.000 homes. That’s a lot - but without LED technology, it would have been a multiple of that (leaving aside the technical challenges of using incandescent bulbs in a similar way). Someone, somewhere, figured out that it’s economically viable to expend 28 MW in return for the revenue from a sell-out gig.

The same is true for software. As the cost comes down, consumption goes up, and in a business context, software generally reduces costs. That money may disappear in shareholder pockets - but more commonly, those reduced costs for a specific capability open up new opportunities, which in turn create a demand elsewhere.

Augmentation versus replacement

I’m a sceptic when it comes to AI replacing huge number of people. Remember how self-driving cars were “very close”? Even though the various LLMs we see now are super impressive, there are many tasks humans do very easily at which they fail - and the hallucination problem is not going anywhere. When an AI beat a human at Go, many people saw that as a quantum leap in AI capability - but there are relatively simple ways to defeat those AIs.

At a very simplistic level, I think all these limitations come from the fact that the AI systems we build today don’t have an internal model of the world. When you ask an LLM to write a joke about a cat, it doesn’t think about what it knows about felines and their behaviour, and look for things it believes are “funny” - it effectively searches a huge training set, in ways we can’t quite express, for data that matches cat, joke, etc.

Now, in my domain - writing software - there are many tasks that can be learned from a training set. To connect to a database, you can simply find examples of connecting to a database from your training set, without understanding the concept of “database”.

But to write a complex system that represents entities in the real world, where you’re unlikely to find a close match in your dataset, you do kinda need a model in your mind of those entities, and a process for discovering more about them.

So what?

I think these things are all true:

  • AI tools will improve the productivity of professionals, just like IDEs and virtualization improved the productivity of software developers in the past. This increased productivity may reduce demand for those professionals - but I expect that the reduced cost will increase demand.
  • AI tools may reduce the need for some roles - especially “junior” roles. In my experiments with ChatGPT and Copilot, I found that much of the work that would normally go to a more junior developer can be done by an LLM. “Here’s the outline for a method - fill it out, add error handling and make sure it’s robust”, “Here’s a schema and some sample queries, write a new query to do X”, “Here’s a failing unit test, improve the code so that it passes”.
  • There is an upper limit on the improvement of LLMs to handle real-world cases. The improvement curve is an S-curve, not a J-curve. I believe this for several reasons: the large LLMs have used all the easily accessible training data, the cost (energy, compute, whatever) of improvements is not linear with the quality of the improvement - adding another dimension increases the cost exponentially. And, philosophically, without a model of the world, LLMs depend on memorization and retrieval, but many real-life situations don’t lend themselves to that model.

Digression: model of the world.

Let’s take, as an example, the challenge of recommending a book to someone.

Before Amazon, there was a website (I’ve forgotten the name now) where you could say which book categories you liked, and which you didn’t, and it would recommend a “secret mix” of books you might like, which looked suspiciously like the best sellers in the categories you liked. It was not particularly useful.

Then came Amazon, with its “People who bought this also bought that” recommender. At the time, it felt like magic - it felt like Amazon had looked at my book cases, and knew what I liked. The way it did that was to find people whose choices were like mine, and show me things they’d bought that I had not. While it felt like magic, it doesn’t necessarily work for new books - because not enough people have bought the book yet. It also doesn’t really generalize - my preference in books would not necessarily carry across to my taste in cheese.

Now, when you ask ChatGPT for recommendations, the results aren’t inspiring. A prompt like “Here are 10 authors I like, can you recommend any books?” gives you back many books by those authors, and then some very obvious best sellers within the same categories. Asking for less obvious recommendations doesn’t really do much. Engaging in conversation - answering questions etc. - still doesn’t seem to create compelling recommendations. For instance, even though I said “I’ve read all of this author’s books”, it still recommended several books by that author…If you then ask ChatGPT to recommend cheese based on your literary tastes, you get a generic list of cheeses.

The way the LLM does this is hard to explain - Stephen Wolfram’s explanation is the most user friendly one I’ve found. But what matters is that it is not forming an understanding of books, people, tastes, cheese - it’s building a huge library of “what’s the most likely next (part of a) word given this chain of other (parts of) words”. This creates very credible text, and it’s good enough now that it broadly answers the question. But it doesn’t really understand. You can ask it ever more ridiculous questions - “based on my literary tastes, which shampoo would you recommend?” and it keeps answering. The answers look credible - but I’d be amazed if my taste in books predicts which shampoo I’d like.

Compare that with how a human would approach this problem - humans create an internal model of someone based on what they tell you, and, somehow, match that model to books. Or cheese. I don’t think we know quite how we do it - partly based on experience (“my uncle liked Len Deighton novels, and he loved Red Leicester!”), partly based on stereotypes (“Len Deighton is a bit old fashioned, quite blokey, I bet you also like Red Leicester”), partly on half-remembered snippets (“I think there’s a bit in one of the Deighton novels where he mentions Red Leicester - let’s go for that!”). We’re quite likely to say “I’m not sure there’s a reasonable way to go from books to cheese”.

The same is true when we ask the LLM to write software. It does a pretty decent job - and because it’s probably seen lots of examples of similar code, it gives you something that will probably work. And for many applications, there’s prior art - if not for the whole thing, then for sub systems. But I don’t believe that an approach that is basically “find the closest analog to the prompt in my database” will allow you to build brand new applications.

Much of software development - and knowledge work in general - is not about “writing code”. It’s about navigating the context, exploring and balancing the many tradeoffs, choosing what to optimize and what to downplay. It’s about breaking large, complex problems down into chunks you can reason about. It’s about thinking about edge cases - not just the things that go “right”, but all the ways they can go wrong. And to do that, we humans definitely depend on heuristics - “if it’s an ecommerce application, it probably needs to deal with Black Friday traffic spikes”. But we also depend on our understanding of the problem domain - “if it’s an ecommerce site selling medical equipment, there may be a regulatory requirements”, “if it’s an ecommerce site sellling high-value items, we should think about fraud prevention”.

So, consider the following requirement:

  design a database schema for an ecommerce site

You can pass this as a prompt to ChatGPT and get a perfectly reasonable response. Give it to a moderately experienced developer, and they’ll ask about your business domain - regulated? High fraud risk? What sort of traffic spikes do you expect? How many products do you have? How many customers are you expecting? Is this a one-off experiment, or the basis for a strategic asset? In their head, they’ll convert your initial requirement into something like:

  Design a database schema for a website that see occasional traffic spikes, sells low and medium value items, some of which are regulated. The design needs to be extensible.

Sure, if you feed that prompt into ChatGPT, it will do an equally credible job - but right now, I wouldn’t be confident that “credible” is the same as “definitely meets my requirements”.

Get to the point, Nev!

The reason humans can recommend books, or design database schemas, in ways that the LLM can only achieve once you’ve prompted it with all the additional context, is that we have a model of the world that goes beyond “context and prompt”.

I think that any job that depends on manipulating information - and that’s pretty much every job - can be improved by our current generation of AI tools. The jobs where people interact with that information through digital means are the obvious starting point. The productivity enhancements are likely to increase the number of jobs that use digital tools - for instance, https://www.chefrobotics.ai/

Super human?

Computers have been superhuman for - well, since the abacus. A computer can do sums faster than a human can. It can store, search and process more information than a human can. It can control motors with more accuracy, sense feedback more granularly, it can try more possible solutions than a human can. And over time, the area where computers have become “super human” has grown - auto pilots, chess and go, spotting anomalies on x-rays.

At this point, I think we can say that computers can become superhuman in domains where one or more apply:

  • the rules can be known, and expressed. A rule that can be known and expressed is also known as an algorithm. Algorithms now scale to planet-size data - Amazon recommends books, your phone can guess the next word you want to type, Google can find the most relevant web page for your search term.
  • the rules cannot be known or expressed, but there’s a bounded universe and set of goals and constraints that can be expressed formally. In chess, the constraints are the rules, and the goals are to win. Even though the number of possible chess games is effectively https://en.wikipedia.org/wiki/Shannon_numbe), machine learning algorithms can search the space and find ways in which they can win.
    • the rules cannot be known or expressed, but there’s a lot of training data. While the rules of grammar can be expressed, the rules of “creating meaningful text” cannot, but by feeding huge amounts of text to ChatGPT, it can find “rules” that allow it to generate meaningful text. ChatGPT is superhuman in the sense that it can create meaningful text in ways humans cannot - my abilities top out at about 5 languages; ChatGPT can generate meaningful text in far more languages, and translate credibly between them.

In all those domains, I think the rapid increase in processing power and available data are likely to lead to equally rapid improvements of AI capability.

But…

To become superhuman in more than one domain - chess and language generation? Detection anomalies in X-rays and recommending books - is nowhere near on the cards. The first LLMs would make hilarious mistakes when asked maths questions - they are now much better, but that’s because they outsource the maths to a component that is not an LLM.

So, theoretically, you could imagine a huge AI made up of an LLM to do language processing and a bunch of “co processors” to do specific tasks like play chess, or do sums, or recognize things in images. It would almost certainly be a terrible driver, though - because the problems we solve when driving are not really about language, or even expressible as language.

And without a model of the world, an understanding of cause and effect, not just the next most most likely token, you can imagine an LLM being trained to ask questions about requests, to allow them to self-refine the prompt that executes the task (i think that would be an interesting experiment!). [https://arxiv.org/abs/2205.11916}(Zero-shot chain-of-thought prompting) shows how an LLM can be encouraged to “reason”; it wouldn’t be too big a leap to include “ask questions” in the interaction model to improve chain-of-thought.

Nevertheless, I put “reasoning” is scare quotes. I think it’s more accurate (right now, at least) to say that LLMs exhibit behaviour that looks like reasoning. Or rather, it looks like human reasoning. But, fundamentally, it isn’t. And there are a number of limitations that currently don’t have a credible resolution - mostly because the lack of model of the world. When an LLM encounters a situation it cannot match to its training data, its behaviour is…unpredictable. And there’s a limit to context (the amount of information the LLM takes into consideration when deciding what to do next) and the depth and subtlety reasoning you can get out of an LLM. As I understand it, the context and depth characteristics scale exponentially in terms of cost - so it’s not obvious the “LLM as Swiss Army knife of AI” model is sustainable.

Prediction.

I’m in my late 50s. I don’t think my job as a software consultant will be replaced by AI before I retire. But I do expect to use LLMs to become more productive, and I expect to see many roles to be transformed - software development will be ever less about remembering the magic invocations that the processor understands, and ever more about understanding the problem domain and describing it precisely and consistently.

Updated: