So I stirred up a bit of conversation on Twitter last week when I noted that I had already been handed ChatGPT produced assignments.1 For those who are unaware, ChatGPT is an ‘AI’ chatbot that given a prompt can produce texts; it is one of most sophisticated bots of this sort yet devised, trained on a massive amount of writing (along with substantial human input in the training process, something we’ll come back to). And its appearance has made a lot of waves and caused a fair bit of consternation.
Now I should note at the outset that while I am going to argue that ChatGPT is – or at least ought to be – basically useless for doing college assignments, it is also wrong to use it for this purpose. Functionally all university honor codes prohibit something like ‘unauthorized aid or assistance’ when completing an assignment. Having a chatbot write an assignment – or any part of that assignment – for you pretty clearly meets that definition. Consequently using ChatGPT on a college essay is pretty clearly an impermissible outside aid – that is to say, ‘cheating.’ At most universities, this sort of cheating is an offense that can lead to failing classes or expulsion. So however irritating that paper may be, it is probably not worth getting thrown out of college, money wasted, without a degree. Learn. Don’t cheat.
That said I want to move through a few of my basic issues: first, what ChatGPT is in contrast to what people seem to think it is. Second, why I think that functionality serves little purpose in essay writing – or more correctly why I think folks that think it ‘solves’ essay writing misunderstand what essay writing is for. Third, why I think that same functionality serves little purpose in my classroom – or more correctly why I think that folks that think is solves issues in the classroom fundamentally misunderstand what I am teaching and how.
Now I do want to be clear at the outset that I am not saying that this technology has no viable uses (though I can’t say I’ve yet seen an example of a use I would consider good rather than merely economically viable for ChatGPT in particular) and I am certainly not saying that future machine-learning based products, be they large language models or other products, will not be useful (though I do think that boosters of this technology frequently assume applications in fields they do not understand). Machine learning products are, in fact, already useful and in common use in ways that are good. But I think I will stipulate that much of the boosterism for ChatGPT amounts to what Dan Olsen (commenting on cryptocurrency) describes as, “technofetishistic egotism,” a condition in which tech creators fall into the trap where, “They don’t understand anything about the ecosystems they’re trying to disrupt…and assume that because they understand one very complicated thing, [difficult programming challenges]…that all other complicated things must be lesser in complexity and naturally lower in the hierarchy of reality, nails easily driven by the hammer that they have created.”
Of course that goes both ways which is why I am not going to say what capabilities machine learning may bring tomorrow. It is evidently a potentially powerful technology and I am not able to assess what it may be able to do in the future. But I can assess the observes capabilities of ChatGPT right now and talk about the implication those capabilities have in a classroom environment, which I do understand.2 That means – and I should be clear on this – this is a post about the capabilities of ChatGPT in its current form; not some other machine learning tool or AI that one imagines might exist in the future. And in that context what I see does not convince me that this technology is going to improve the learning experience; where it is disruptive it seems almost entirely negatively so and even then the disruption is less profound than one might think.
Now because I am not a chatbot but instead a living, breathing human who in theory needs to eat to survive, I should remind you that if you like what you are reading here you can help by sharing what I write (for I rely on word of mouth for my audience) and by supporting me on Patreon. And if you want updates whenever a new post appears, you can click below for email updates or follow me on twitter (@BretDevereaux) for updates as to new posts as well as my occasional ancient history, foreign policy or military history musings, assuming there is still a Twitter by the time this post goes live.
The Heck is a ChatGPT?
But I think we want to start by discussing what ChatGPT is and what it is not; it is the latter actually that is most important for this discussion. The tricky part is that ChatGPT and chatbots like it are designed to make use of a very influential human cognitive bias that we all have: the tendency to view things which are not people as people or at least as being like people. We all do this; we imagine our pets understand more than they can, have emotions more similar to ours than they do,3 or that inanimate objects are not merely animate but human in their feelings, memories and so on. We even imagine that the waves and winds are like people too and assign them attributes as divine beings with human-like emotions and often human-like appearances. We beg and plead with the impersonal forces of the world like we would with people who might be moved by those emotions.
The way ChatGPT and other chatbots abuse that tendency is that they pretend to be like minds – like human minds. But it is only pretend, there is no mind there and that is the key to understanding what ChatGPT is (and thus what it is capable of). Now I can’t claim to understand the complex computer science that produced this program (indeed, with machine learning programs, even the creators sometimes cannot truly understand ‘how’ the program comes to a specific result), but enough concerning how it functions has been discussed to get a sense of what it can and cannot do. Moreover its limitations (demonstrated in its use and thus available for interrogation by the non-specialist) are illustrative of its capabilities.
ChatGPT is chatbot (a program designed to mimic human conversation) that uses a large language model (a giant model of probabilities of what words will appear and in what order). That large language model was produced through a giant text base (some 570GB, reportedly) though I can’t find that OpenAI has been transparent about what was and was not in that training base (though no part of that training data is post-2021, apparently). The program was then trained by human trainers who both gave the model a prompt and an appropriate output to that prompt (supervised fine tuning) or else had the model generate several responses to a prompt and then humans sorted those responses best to worst (the reward model). At each stage the model is refined (CGP Grey has a very accessible description of how this works) to produce results more in keeping with what the human trainers expect or desire. This last step is really important whenever anyone suggests that it would be trivial to train ChatGPT on a large new dataset; a lot human intervention was in fact required to get these results.
It is crucial to note, however, what the data that is being collected and refined in the training system here: it is purely information about how words appear in relation to each other. That is, how often words occur together, how closely, in what relative positions and so on. It is not, as we do, storing definitions or associations between those words and their real world referents, nor is it storing a perfect copy of the training material for future reference. ChatGPT does not sit atop a great library it can peer through at will; it has read every book in the library once and distilled the statistical relationships between the words in that library and then burned the library.
ChatGPT does not understand the logical correlations of these words or the actual things that the words (as symbols) signify (their ‘referents’). It does not know that water makes you wet, only that ‘water’ and ‘wet’ tend to appear together and humans sometimes say ‘water makes you wet’ (in that order) for reasons it does not and cannot understand.
In that sense, ChatGPT’s greatest limitation is that it doesn’t know anything about anything; it isn’t storing definitions of words or a sense of their meanings or connections to real world objects or facts to reference about them. ChatGPT is, in fact, incapable of knowing anything at all. The assumption so many people make is that when they ask ChatGPT a question, it ‘researches’ the answer the way we would, perhaps by checking Wikipedia for the relevant information. But ChatGPT doesn’t have ‘information’ in this sense; it has no discrete facts. To put it one way, ChatGPT does not and cannot know that “World War I started in 1914.” What it does know is that “World War I” “1914” and “start” (and its synonyms) tend to appear together in its training material, so when you ask, “when did WWI start?” it can give that answer. But it can also give absolutely nonsensical or blatantly wrong answers with exactly the same kind of confidence because the language model has no space for knowledge as we understand it; it merely has a model of the statistical relationships between how words appear in its training material.
In artificial intelligence studies, this habit of manufacturing false information gets called an “artificial hallucination,” but I’ll be frank I think this sort of terminology begs the question.4 ChatGPT gets called an artificial intelligence by some boosters (the company that makes it has the somewhat unearned name of ‘OpenAI’) but it is not some sort of synthetic mind so much as it is an extremely sophisticated form of the software on your phone that tries to guess what you will type next. And ChatGPT isn’t suffering some form of hallucination – which is a distortion of sense-perception. Even if we were to say that it can sense-perceive at all (and this is also question-begging), its sense-perception has worked just fine: it has absorbed its training materials with perfect accuracy, after all; it merely lacks the capacity to understand or verify those materials. ChatGPT isn’t a mind suffering a disorder but a program functioning perfectly as it returns an undesired output. When ChatGPT invents a the title and author of a book that does not exist because you asked it to cite something, the program has not failed: it has done exactly what was asked of it, putting words together in a statistically probable relationship based on your prompt. But calling this a hallucination is already ascribing mind-like qualities to something that is not a mind or even particularly mind-like in its function.
Now I should note the counter-argument here is that by associating words together ChatGPT can ‘know’ things in some sense because it can link those associations. But there are some major differences here. First, human minds assess the reliability of those associations: how often when asked a question does an answer pop into your mind that you realize quickly cannot be right or you realize you don’t know the answer at all and must look it up? Part of that process, of course, is that the mental associations we make are ‘checked’ against the real world realities they describe. In fancy terms, words are merely symbols of actual real things (their ‘referents‘ – the things to which they refer) and so the truth value of words may be checked against the actual status of their referents. For most people, this connection is very strong. Chances are, if I say ‘wool blanket’ your mind is going to not merely play word association but also conjure up some memories of actual wool blankets – their sight, touch or smell. ChatGPT lacks this capability; all it has are the statistical relationship between words stripped entirely of their referents. It will thus invent descriptions for scientific phenomenon that aren’t real, embellish descriptions of books that do not exist and if asked to cite things it will invent works to cite, because none of those things is any more or less real to ChatGPT than actual real existing things.
All it knows, all it knows are the statistical relationships of how words appear together, refined by the responses that its human trainers prefer. Thus the statement that ChatGPT doesn’t know anything about anything or more correctly it cannot know anything about the topics it is asked to write about.
All of that is important to understand what ChatGPT is doing when you tell it to, say, write an essay. It is not considering the topic, looking up references, thinking up the best answer and then mobilizing evidence for that answer. Instead it is taking a great big pile of words, picking out the words which are most likely to be related to the prompt and putting those words together in the order-relationships (but not necessarily the logical relationships) that they most often have, modified by the training process its gone through to produce ‘better’ results. As one technical writer, Ted Chiang, has put it, the result is merely a ‘very lossy’ (that is, not very faithful) reproduction of its training materials, rather than anything new or based on any actual understanding of the underlying objects or ideas. But, because it is a chatbot, its can dole those words out in tremendous quantity, with flawless spelling and grammar and to follow whatever formula (more or less) the prompt asks for. But it doesn’t know what those words mean; indeed coming from the chatbot, in a sense they mean nothing.
I stress this functionality at the beginning because I want readers to understand that many of the mental processes – analysis, verification, logical organization – that we take for granted from a thinking person are things ChatGPT does not do and is entirely incapable of in the same way that an electric can-opener cannot also double as a cell phone. Those capabilities are both entirely outside of the structure of the current iteration of ChatGPT and also entirely outside of the processes that the training procedures which produced ChatGPT will train. Incremental improvements in the can-opener will not turn it into a cell phone either; the cell phone is an entirely different sort of machine. Thus the confidence among some that the ‘hallucination’ problem will be inevitably solved seems premature to me. It may well be solved, but it may well not; doing so will probably require the creation of an entirely new sort of machine of a type never before created. That eventuality cannot be taken for granted; it is not even something that we know is possible (though it may well be!). It most certainly will not happen on its own.
The Heck Is an Essay?
So that is what ChatGPT does: in response to a prompt, it puts together an answer that is composed of words in its training material organized based on the statistical probability that those words appear together and the degree to which they are related to the prompt (processed through an extremely complex language model). It thus assembles words from its big bag of words in a way that looks like the assemblages of words it has seen in its training and which its human trainers have ranked highly. And if all you want ChatGPT to do is precisely that: somewhat randomly assemble a bunch of words loosely related to a topic in a form that resembles communication, it can do that for you. I’m not sure why you want it to do that, but that is the one and only thing it can do.
But can ChatGPT write an essay?
It has been suggested that this endangers or even makes obsolete the essay or particularly the ‘college essay,’ and I think this misunderstands what the purpose of an essay is. Now the definition of an essay is somewhat nebulous, especially when it comes to length; essays are shorter than books but longer than notes but these too are nebulously defined. Still we can have a useful definition:
An essay is a piece of relatively short writing designed to express an argument – that is, it asserts a truth about something real outside of the essay itself – by communicating the idea of argument itself (the thesis) and assembling evidence chosen to prove that argument to a reader. Communication is thus part of writing an essay, but not the only part or even necessarily the most important. Indeed, the communication element may come in entirely different forms from the traditional essay. Consider video essays or photo essays: both have radically changed the form of communication but they remain essays because the important part – the argument asserting a truth about something supported by assembled evidence – remains the same, even as the nature of the evidence and communication has changed.
Writing an essay thus involves a number of steps, of which communication is merely the last. Ideally, the essay writer has first observed their subject, then drawn some sort of analytical conclusion about that subject,5 then organized their evidence in a way that expresses the logical connections between various pieces of evidence, before finally communicating that to a reader in a way that is clear and persuasive.
ChatGPT is entirely incapable of the first two steps (though it may appear to do either of them) and incompetent at the third; it’s capabilities are entirely on the last step (and even there generally inferior to a well-trained human writer at present).
When it comes to observing a subject, as noted ChatGPT is not capable of research so the best it can do, to borrow Ted Chiang’s phrasing again, is provide a ‘lossy’ replica of the research of others and only if that research has somehow found its way into ChatGPT’s training materials. Even when the necessary information is contained within the works in ChatGPT’s training material, it can’t actually understand those things, it can only reproduce them, so if they do not explicitly draw the conclusion it needs in as many words, ChatGPT can’t do so either. We can demonstrate this by asking ChatGPT an almost trivially easy research question, like, “What is the relationship between Edward Luttwak’s Grand Strategy of the Roman Empire and Benjamin Isaac’s The Limits of Empire?” And so we did:
If you know nothing about either book, this answer almost sounds useful (it isn’t).6 Now this is a trivial research task; simply typing ‘the limits of empire review’ into Google and then clicking on the very first non-paywalled result (this review of the book by David Potter from 1990) and reading the first paragraph makes almost immediately clear the correct answer is that Isaac’s book is an intentional and explicit rebuttal of Luttwak’s book, or as Potter puts it, “Ben Isaac’s The Limits of Empire offers a new and formidable challenge to Luttwack.” A human being who understands the words and what they mean could immediately answer the question, but ChatGPT which doesn’t, cannot: it can only BS around the answer by describing both books and then lamely saying they “intersect in some ways.” The information ChatGPT needed was clearly in its training materials (or it wouldn’t have a description of either book to make a lossy copy of),7 but it lacks the capacity to understand that information as information (rather than as a statistically correlated sequence of words).8 Consequently it cannot draw the right conclusion and so talks around the question in a convincing, but erronous way.
Note that no analysis was required for the above question! It was a pure reading comprehension question that could be solved by merely recognizing that something in the training set already said the answer and copying it, but ChatGPT wasn’t even capable of that because while it has a big bag of words related to both books, it lacks the capability to understand and grab the relevant words. This is an example of the not at all uncommon situation where Google is a far better research tool than ChatGPT, because Google can rely on your reading comprehension to understand the places it points you to which may have the answer you seek.
So research and observation are out; what about analysis? Well, if you have been following along you’ll realize that ChatGPT is incapable of doing that too. What it can do is find something that looks like analysis (though it may not be analysis or it may be quite bad analysis) and then reproduce it (in a lossy form) for you. But the point of analysis is to be able to provide novel insight, that is to either suggest a conclusion hitherto unconsidered for a given problem or equally importantly to come up with a conclusion for a problem that is only being encountered for the very first time. ChatGPT, limited entirely to remixing existing writings, cannot do either.
As a system to produce essays, this makes ChatGPT not very useful at all. Generally when people want an essay, they don’t actually want the essay; the essay they are reading is instead a container for what they actually want which is the analysis and evidence. An essay in this sense is a word-box that we put thoughts in so that we can give those thoughts to someone else. But ChatGPT cannot have original thoughts, it can only remix writing that is already in its training material; it can only poorly copy writing someone else has already done better somewhere.9 ChatGPT in this sense is like a friendly, if somewhat daft neighbor who noticed one day that every so often you get a box from Amazon and that you seem quite happy to get it and so decides to do you a favor by regularly ordering empty Amazon boxes to your house. The poor fellow does not know and cannot understand that it was the thing in the box – in this case, the thoughts (original observations, analysis, evidence) in the essay – that you actually wanted. ChatGPT doesn’t have any thoughts to give you (though it can somewhat garble someone else’s thoughts), but it sure can order you up a bunch of very OK boxes.
In a very real sense then, ChatGPT cannot write an essay. It can imitate an essay, but because it is incapable of the tasks which give an essay its actual use value (original thought and analysis), it can only produce inferior copies of other writing. That quite a few people, including some journalists, have supposed that ChatGPT can write an essay suggests to me that they have an impoverished idea of what an essay is, viewing it only as ‘content’ rather than as a box that thoughts go into for delivery, or haven’t really scrutinized what ChatGPT outputs closely enough.
Now there are, in that previous analogy, box-sellers online: outlets who really do not care about the thoughts in the essay but merely want units of text to throw up to generate clicks. Few reputable publications function this way – that’s why they have editors whose job is to try to figure out if your essay has a thought in it actually worth sharing and then if so to help guide you to the most effective presentation of that thought (that’s the editing process). But there are a lot of content mills online which are really looking to just supply large amounts of vaguely relevant text at the lowest possible cost hoping to harvest views from gullible search engines. For those content mills, ChatGPT potentially has a lot of value but those content mills provide almost no value to us, the consumer. Far from it, they are one of the major reasons why folks report declining search engine quality, as they crowd out actually useful content.10
That said I don’t want to rule out ChatGPT’s ability to produce functional formulaic documents entirely. I’ve heard it suggested that it could massively reduce the cost of producing formula-driven legal and corporate documents and perhaps it can. It’s also been suggested it could be trained to write code, though my understanding is that as of now, most of the code it produces looks good but does not work well. I don’t write those sorts of things, though, so I can’t speak to the question. I would be concerned though, because ChatGPT can make some very bad mistakes and has no way of catching those mistakes, so very high stakes legal or corporate documents seems like a risky use of ChatGPT. ChatGPT can’t write a good essay, but a bad essay only wastes a few minutes of your time; a bad contract can cost a company millions and a single bad line of code can crash an entire program (or just cause it to fail to compile and in either case waste hours and hours of bug-hunting to determine what went wrong).
But the core work of the essay? This ChatGPT cannot do. And importantly it is not some capacity which merely requires iterative improvements on the product. While ChatGPT can fake an original essay, the jump from faking that essay to writing an actually original thought certainly looks like it would require a completely different program, one capable of observing the real world, analyzing facts about it and then reaching conclusions.
The Heck is the Teaching Essay For?
That leaves the role of ChatGPT in the classroom. And here some of the previous objections do indeed break down. A classroom essay, after all, isn’t meant to be original; the instructor is often assigning an entire class to write essays on the same topic, producing a kaleidoscope of quite similar essays using similar sources. Moreover classroom essays are far more likely to be about the kind of ‘Wikipedia-famous’ people and works which have enough of a presence in ChatGPT’s training materials for the program to be able to cobble together a workable response (by quietly taking a bunch of other such essays, putting them into the blender and handing out the result, a process which in the absence of citation we probably ought to understand as plagiarism). In short, many students are often asked to write an essay that many hundreds of students have already written before them. And so there were quite a few pronouncements that ChatGPT had ‘killed’ the college essay. And indeed, in my own experience in the Twitter discourse around the system, one frequent line of argument was that ChatGPT was going to disrupt my classroom, so shouldn’t I just go ahead and get on board with the new technology?
This both misunderstands what the college essay is for as well as the role of disruption in the classroom. Let’s start with the first question: what is the teaching essay (at any level of schooling) for? It’s an important question and one that arises out of a consistent problem in how we teach students, which is that we rarely explain our pedagogy (our ‘teaching strategy’) to the students. That tends to leave many assignments feeling arbitrary even when teachers have in fact put a great deal of thought into why they are assigning what they are and what skills they are supposed to train. So let’s talk about why we assign essays, what those assignments are supposed to accomplish and why ChatGPT has little to offer in that realm.
In practice there are three things that I am aiming for an essay assignment to accomplish in a classroom. The first and probably least important is to get students to think about a specific historical topic or idea, since they (in theory) must do this in order to write about it. In my own planning I sometimes refer to these assignments as ‘pedagogical’ essays (not a perfect term) where the assignment – typically a ‘potted’ essay (short essay with pre-chosen sources handed to students, opposite of a ‘research’ essay) – is meant to have students ponder a specific question for the value of that question. One example is an essay prompt I sometimes use in my ancient history survey asking students, “On what basis do we consider Alexander to be ‘great’? Is this a sound basis to apply this title?” Obviously I want students here to both understand something about Alexander but also to think about the idea of greatness and what that means; does successfully killing a lot of people and then failing to administer what remains qualify as greatness and if so what does that say about what we value? Writing the essay forces them to ponder the question. That value is obviously lost if they just let ChatGPT copy some other essay for them.
That said this first sort of goal is often the least important. While of course I think my course material matters, the fact is few students will need to be able to recall from memory the details of Alexander the Great at some point in their life. They’ll be able to look him up and hopefully with the broad knowledge framework I’ve given them and the research and analysis skills, be able to reach for these same conclusions. Which brings us to:
The second goal and middle in importance is training the student in how to write essays. I’ve made this element of my approach more explicit in recent years, making the assignments more closely resemble the real world writing forms they train for. Thus the classics 3-5 page paper becomes the c. 1000 word think-piece (though I do require a bit more citation than a print publication would in a ‘show your work’ sort of way), the sort paper becomes a 700-800 word op-ed, etc. The idea here is to signal to students more clearly that they are training to write real things that exist in the world outside of the classroom. That said, while a lot of students can imagine situations in which they might want to write an op-ed or a think piece or a short speech, many of them won’t ever write another formal essay after leaving college.
Thus the last and most important thing I am trying to train is not the form of the essay nor its content, but the basic skills of having a thought and putting it in a box that we outlined earlier. Even if your job or hobbies do not involve formal writing, chances are (especially if your job requires a college degree) you are still expected to observe something real, make conclusions about it and then present those conclusions to someone else (boss, subordinates, co-workers, customers, etc.) in a clear way, supported by convincing evidence if challenged. What we are practicing then is how to have good thoughts, put them in good boxes and then effectively hand that box to someone else. That can be done in a formal written form (the essay), in informal writing (emails, memos, notes, Slack conversations), or verbally (speeches, but also arguments, debates and discussions). The skills of having the idea, supporting it with evidence, organizing that evidence effectively to be understood and then communicating that effectively are transferable and the most important skills that are being practiced when a student writes an essay.
Crucially – and somehow this point seems to be missed by many of ChatGPT’s boosters I encountered on social media – at no point in this process do I actually want the essays. Yes, they have to be turned in to me and graded and commented because that feedback in turn is meant to both motivate students to improve but also to signal where they need to improve.11 But I did not assign the project because I wanted the essays. To indulge in an analogy, I am not asking my students to forge some nails because I want a whole bunch of nails – the nails they forge on early attempts will be quite bad anyway. I am asking them to forge nails so that they learn how to forge nails (which is why I inspect the nails and explain their defects each time) and by extension also learn how to forge other things that are akin to nails. I want students to learn how to analyze, organize ideas and communicate those ideas.
What one can immediately see is that a student who simply uses ChatGPT to write their essay for them has simply cheated themselves out of the opportunity to learn (and also wasted my time in providing comments and grades). As we’ve seen above, ChatGPT cannot effectively replace the actual core tasks we are training for, so this is not a case where the existence of spinning jennies renders most training at hand spinning obsolete. And it certainly doesn’t fulfill the purpose of the assignment.
To which some boosters of the technology respond that what I should really be doing is training students on how to most effectively use ChatGPT as a tool. But it is not clear to me that ChatGPT functions well as a tool for any part of this process. One suggestion is to write an outline and then feed that into ChatGPT to generate a paper, but that fails to train the essential communication component of the assignment and in any case, ChatGPT is actually pretty bad at the nuts of and bolts of writing paragraphs. Its tendency in particular to invent facts or invent non-existent sources to cite makes it an enormous liability here; it is a very bad research tool because it is unreliable. Alternately the suggestion is that students could use ChatGPT to produce an essay they edit to fit or an outline they fill in; both problems run into the issue that the student is now trying to offload the most important part of the task for them to learn: the actual thinking and analysis. And the crucial thing to note is that the skill that is not being trained in both cases is a skill that current large language models like ChatGPT cannot perform or perform very poorly.12
I suspect this argument looks plausible to people because they are not thinking in terms of being trained to think about novel problems, but in terms of the assignment itself; they are thinking about the most efficient way to produce ‘one unit of essay.’ But what we’re actually doing is practicing a non-novel problem (by treating it as a novel problem for the purpose of the assignment), so that when we run into novel problems, we’ll be able to apply the same skills. Consequently they imagine that ChatGPT, trained as it is on what seems to be an awful lot of mediocre student essays (it mimics the form of a bad student essay with remarkable accuracy), can perform the actual final task in question, but it cannot.
Conclusion: Preparing to Be ‘Disrupted.’
The reply that all of this gets has generally been some combination of how this technology is ‘the future,’ that it will make essay writing obsolete so I should focus on training for it,13, and most of all that the technology will soon be so good, if it is not already, that any competent student will be able to use it to perfectly fake good papers. Thus, I am told, my classroom is doomed to be ‘disrupted’ by this technology so I should preemptively surrender and get on board.
And no. No, I don’t think so.
I do think there are classrooms that will be disrupted by ChatGPT, but those are classrooms where something is already broken. Certainly for a history classroom, if ChatGPT can churn out a decent essay for your assignment, chances are the assignment is poorly designed. ChatGPT after all cannot analyze a primary source (unless it is already been analyzed many times in its training materials), it struggles to cite scholarship (more often inventing fake sources) and it generally avoids specific evidence. Well-designed assignments which demand proper citation, specific evidence to support claims (rather than general statements) and a clear thesis are going to be beyond ChatGPT and indeed require so much editing to produce from a ChatGPT framework as to make it hardly worth the effort to cheat. If your essay prompt can be successfully answered using nothing but vague ChatGPT generated platitudes, it is a bad prompt.14
Meanwhile, ChatGPT responses seem to be actually pretty easy to spot once you know how to look for the limitations built into the system. There are already programs designed to detect if a piece of writing is machine-written; they’re not fully reliable yet but I suspect they will become more reliable over time mostly because it is in the interests of both AI-developers (who do not want their models trained on non-human produced writing) and search engines (who want to be able to exclude from search results the veritable river of machine-produced content-mill garbage we all know is coming) to develop that capability. But because of the ways ChatGPT is limited, a human grader should also be able to flag ChatGPT generated responses very quickly too.
It should be trivially easy, for instance, for a grader to confirm if the sources a paper cites exist.15 A paper with a bunch of convincing sounding but entirely invented sources is probably machine-written because humans don’t tend to make that mistake. If instead, as is its want, the paper refers merely vaguely to works written by a given author or on a given topic, insist the student produce those works (and require citation on all papers) – this will be very hard for the student with the ChatGPT paper as those works will not, in fact, exist.16 ChatGPT also has a habit of mistaking non-famous people for famous people with similar names; again for a grader familiar with the material this should be quite obvious.
And then of course there are the errors. ChatGPT makes a lot of factual mistakes, especially as it gets into more technical questions where the amount of material for it to be trained on is less. While the text it produces often looks authoritative to someone with minimal knowledge in that field, in theory the person grading the paper should have enough grounding to spot some of the obvious howlers that are bound to sneak in over the course of a longer research paper.17 By way of example, I asked ChatGPT to write on, “the causes of Roman military success in the third and second centuries BCE.” Hardly a niche topic.18 The whole thing was sufficiently full of problems and errors that I’m just going to include an annotated word document pointing them all out here:
Needless to say, this would not be a passing (C or higher) paper in my class. Exact counting here will vary but I identified 38 factual claims, of which 7 were correct, 7 were badly distorted and 24 were simply wrong. A trainwreck this bad would absolutely have me meeting with a student and raising questions which – if the paper was machine written – might be very hard for the student to answer. Indeed, a research paper with just three or four of these errors would probably prompt a meeting with a student to talk about their research methods. This is certainly then also an error rate which is going to draw my attention and now cause me to ask questions about who exactly wrote the essay and how.19
And that’s the thing: in a free market, a competitor cannot simply exclude a disruptive new technology. But in a classroom, we can absolutely do this thing. I am one of those professors who doesn’t allow laptops for note-taking (unless it is a disability accommodation, of course) because there’s quite a bit of evidence that laptops as note-taking devices lower student performance (quite apart from their potential to distract) and my goal is to maximize learning. This isn’t me being a luddite; I would ban, say, classroom firecrackers or a live jazz band for the same reason and if laptops improved learning outcomes somehow (again, the research suggests they don’t), I’d immediately permit them. Given that detecting machine-writing isn’t particularly hard and that designing assignments that focus on the skills humans can learn that the machines cannot (and struggle to fake) is good pedagogical practice anyway, excluding the technology from my classroom is not only possible it is indeed necessary.
Now will this disrupt some classrooms? Yes. Overworked or indifferent graders will probably be fooled by these papers or more correctly they will not care who wrote the paper because those instructors or graders are either not very much invested in learning outcomes or not given the time and resources to invest however much they might wish to. I think schools are going to need to think particularly about the workload on adjuncts and TAs who are sometimes asked to grade through absurdly high amounts of papers in relatively little time and thus will simply lack the time read carefully enough. Of course given how much students are paying for this, one would assume that resources could be made available to allow for the bare minimum of scrutiny these assignments deserve. Schools may also need to rethink the tradeoffs of hiring indifferent teachers ‘for their research’ or for the prestige of their PhD institutions because the gap between good, dedicated teachers and bad, indifferent ones is going to grow wider as a result of this technology.
Likewise, poorly designed assignments will be easier for students to cheat on, but that simply calls on all of us to be more careful and intentional with our assignment design (though in practice in my experience most professors, at least in history and classics, generally are). I will confess every time I see a news story about how ChatGPT supposedly passed this or that exam, I find myself more than a little baffled and quite concerned about the level of work being expected in those programs. If ChatGPT can pass business school, that might say something rather concerning about business school (or at least the bar they set for passing).
The final argument I hear is that while ChatGPT or large language models like it may not make my job obsolete now, they will inevitably do so in the future, that these programs are inevitably going to improve to the point where all of the limitations I’ve outlined will be surpassed. And I’ll admit some of that is possible but I do not think it is by any means certain. Of the processes we’ve laid out here, observing, analyzing those observations, arranging evidence to support conclusions and then communicating all of that, ChatGPT only does (or pretends to do) the last task. As I noted above, an entirely new machine would be necessary for these other processes and it is not certain that such a machine is possible within the limits of the computing power now available to us. I rather suspect it is, but it doesn’t seem certain that it is.
More broadly, as far as I can tell it seems that a lot of AI research (I actually dislike a lot of these terms which seem to me to imply that what we’ve achieved is a lot closer to a synthetic mind than it really is, at least for now) has proceeded on a ‘fake it till you make it’ model. It makes sense as a strategy: want to produce a mind, but we don’t really know how a mind works at full complexity, so we’ve chosen instead to try to create machines which can convincingly fake being a mind in the hopes that a maximally convincing fake will turn out to be a mind of some sort. I have no trouble imagining that strategy could work, but what I think AI-boosters need to consider is that it also may not. It may in fact turn out that the sort of machine learning we are doing is a dead end.
It wouldn’t be the first time! Early alchemists spent a lot of time trying to transmute lead into gold; they ended up pioneering a lot of chemistry, exploring chemical reactions to try to achieve that result. Important things were learned, but you know what no amount of alchemical proto-chemistry was ever going to do? Turn lead into gold. As a means of making gold those experiments were dead ends; if you want to turn lead into gold you have to figure out some way of ripping three protons off of a lead atom which purely chemical reactions cannot do. The alchemist who devised chemical reactions aiming to producing progressively more convincing fakes of gold until he at last managed the perfect fake that would be the real thing was bound to fail because that final step turns out to be impossible. The problem was that the alchemist had to experiment without knowing what made some things (compounds) different from other things (elements) and so couldn’t know that while compounds could be altered in chemical reactions, elements could not.
In short, just as the alchemist labored without really knowing what gold was or how it worked, but was only able to observe its outward qualities, so too our AI engineers are forced to work without really knowing what a mind is or how it works. This present research may turn out to be the way that we end up learning what a mind really is and how it really works, or it may be a dead end. We may never turn ChatGPT into gold. It may be impossible to do so. Hopefully even if that is the case, we’ll have developed some useful tools along the way, just like those alchemists pioneered much of chemistry in the pursuit of things chemistry was incapable of doing.
In the meantime, I am asking our tech pioneers to please be more alive to the consequences of the machines you create. Just because something can be done doesn’t mean it should be done. We could decide to empirically test if 2,000 nuclear detonations will actually produce a nuclear winter,20 but we shouldn’t. Some inventions – say, sarin gas – shouldn’t be used. Discovering what we can do is always laudable; doing it is not always so. And yet again and again these new machines are created and deployed with vanishingly little concern about what their impacts might be. Will ChatGPT improve society, or just clutter the internet with more junk that will take real humans more time to sort through? Is this a tool for learning or just a tool to disrupt the market in cheating?
Too often the response to these questions is, “well if it can be done, someone will do it, so I might as well do it first (and become famous or rich),” which is both an immorally self-serving justification but also a suicidal rule of conduct to adopt for a species which has the capacity to fatally irradiate its only biosphere. The amount of power our species has to create and destroy long ago exceeded the point where we could survive on that basis.
And that problem – that we need to think hard about the ethics of our inventions before we let them escape our labs – that is a thinking problem and thus one in which ChatGPT is entirely powerless to help us.
- And I should be clear right here ahead of time that nothing that follows is particular to any paper(s) I may have received. Do not ask “what happened to the student(s)?” or “how did you know?” or “what class was this in?” because I can’t tell you. Student privacy laws in the United States protect that sort of information and it is a good thing they do. The observations that follow are not based on student papers, instead they are based on a number of responses I had ChatGPT produce for me to get a sense of what such an effort at cheating might look like and how I might detect it.
- After all I may not have experience as a creator of large language models, but I am a fully qualified end user. I cannot and indeed will not critique how ChatGPT was created, but I am perfectly qualified to say, “this product as delivered does not meet any of my needs.”
- Not that pets don’t have emotions or some kind of understanding, but we anthropomorphize our pets a lot as a way of relating to them.
- Since I am going to use this phrase a lot I should be clear on its meaning. To ‘beg the question’ is not to ask someone to ask you something, but rather to ask your interlocutor in a debate or discussion to concede as a first step the very thesis you wanted to prove. If we were, say, debating the value of Jane Austin’s writing and I lead by saying, “well, you must first concede she writes extremely well!” that would be question begging. It’s more common to see actual question begging occur as a definitional exercise; an attorney that defines the defendant at a trial as a ‘criminal’ has begged the question, assuming the guilt of the person whose guilt has not yet judged in the proceeding where that is the primary concern.
- In our previous definition this conclusion is an argument, but we could easily expand our definition to also include descriptive essays (which aim not to make a new conclusion about something but merely assemble a collection of generally accepted facts). There is still an analytical process here because the writer must determine what facts to trust, which are important enough to include and how they ought to be arranged, even though no explicit argument is being made. Indeed, such a descriptive essay (like a Wikipedia article) makes an implicit argument based on what it is considered important enough to be included (e.g. on Wikipedia, what exactly is ‘notable’).
- the description of The Limits of Empire in particular is poor and mostly misses the book’s core argument that there was no Roman ‘grand strategy’ because the Romans were incapable of conceiving of strategy in that way.
- I’m pretty sure from the other responses I have seen (but cannot be 100% confident) that the BMCR, which is open and available to all, was included in ChatGPT’s corpus.
- While we’re here I should note that I think The Limits of Empire is hardly the last word on this question. On why, you want to read E. Wheeler, “Methodological Limits and the Mirage of Roman Strategy” JMH 57.1 and 57.2 (1993); Wheeler systematically destroys nearly all of Isaac’s arguments. I also asked ChatGPT to tell me what Wheeler’s critiques were, but since Wheeler isn’t in its training corpus, it couldn’t tell me. When I asked for a list of Isaac’s most prominent critics, it didn’t list Wheeler because, I suppose, no one in its corpus discussed his article, despite it being (to the best of my knowledge) generally understood that Wheeler’s critique has been the most influential, as for instance noted by J.E. Lendon in this review of the topic for Classical Journal back in 2002. ChatGPT can’t tell you any of that because it can only tell you things other people have already written in its training corpus. Instead, it listed Adrian Goldsworthy, Jeremy Armstrong, John W.I. Lee and Christopher S. Mackay because they all wrote reviews of the book; none of these scholars (some of whom are great scholars) are particularly involved in the Roman strategy debate, so all of these answers are wrong. The latest in this debate is James Lacey’s Rome: Strategy of Empire (2022), which is a solid reiteration of the Luttwakian side of the debate (valuable if only because Luttwak himself is a poor interlocutor in all of this) but seems unlikely to end it. It is possible I am working on trying to same something useful on this topic at some point in the future.
- It also isn’t very good at discoverability. It can’t tell you who or where that better idea is from if you find yourself wanting more explanation or context. Once again, as a research tool, Google is pretty clearly superior.
- This is painfully obvious when it comes to trying to get information about video games. In ye days of yore, Google would swiftly send you to the GameFaqs page (remember those!?) or the helpful fan Wiki, but more recently it becomes necessary to slog through a page or two of overly long (because Google prefers pages with at least a certain amount of text) answers to very simple questions in order to find what you are looking for (which usually ends up being a helpful response to someone’s question on Reddit or a Steam guide or, because I still like to live in 2004, an actual GameFaqs page).
- And thus, dear students, if you are not reading the comments you are not getting what you paid tens of thousands of dollars for when you paid tuition. Read the comments. You are in college to learn things not prove what you already know or how smart you already are. We know you are smart, that’s why you got admitted to college; the question now is about drive and willingness to learn.
- There is thus a meaningful difference between this and the ‘why did I need to learn math without a calculator’ example that gets reused here, in that a calculator can at least do basic math for you, but ChatGPT cannot think for you. That said, I had quite a difficult time learning that sort of thing as a kid, but (with some extra effort from my parents) I did learn it and I’ve found it tremendously useful in life. Being able to calculate a tip in my head or compare the per-unit price of, say, 3-for-whatever sale on 12pack sodas vs. a 24pack of the same brand without having to plug it into my phone is really handy. I thus find myself somewhat confused by folks I run into who are bitter they were forced to learn mathematics first without a calculator.
- A point we have already addressed.
- The one exception here are online courses using ‘closed book’ online essay tests. That is an exam model which will be rendered difficult by this technology. I think clever prompt writing (demand the students do things – be specific in evidence or reference specific works – that ChatGPT is bad at) or use alternative assignments (a capstone project or essay instead). For in-person classes, the entire problem is obviated by the written in-class essay.
- And if they don’t, that’s academic dishonestly regardless of who wrote the paper.
- And a student that cannot or will not cite their sources has plagiarized, regardless of who wrote their paper. ChatGPT is such a mess of academic dishonesty that it isn’t even necessary to prove its products were machine-written because the machine also does the sort of things which can get you kicked out of college.
- And if the student has gone back and done the research to be able to correct those errors and rewrite those sentences in advance…at this point why not just write the paper honestly and not risk being thrown out of college?
- In the event I asked for 8,000 words because I wanted to see how it would handle organizing a larger piece of writing. Now in the free version it can’t write that many words before it runs out of ‘tokens,’ but I wanted to see how the introduction would set up the organization for the bits it wouldn’t get to. In practice it set up an essay in three or four chunks the first of which was 224 words; ChatGPT doesn’t seem to be able to even set up a larger and more complex piece of writing. It also doesn’t plan for a number of words limited by how many it can get to before running out of tokens either, in case anyone thinks that’s what it was doing: to get to the end of the essay with all of the components it laid out in the introduction I had to jog it twice.
- Of course if the student has just tried honestly and failed, they’ll be able to document that process quite easily, with the works they read and where each wrong fact came from, whereas the student who has cheated using ChatGPT will be incapable of doing so.
- a hotly debated topic, actually!
ChatGPT is useful as a form of autocorrect on steroids. You can type your own ideas in rough form, directly into ChatGPT, and get functional, readable sentences out. The style will not be great, and ChatGPT won’t supply any content you didn’t give it. But it makes it much easier to turn a stream of consciousness infodump into well-structured and readable paragraphs.
Perhaps in history, good writing is commonplace. I’m in engineering, and terrible writing is all over the place. I wish my colleagues would use ChatGPT more often, not to come up with their ideas, but to make my life easier when I have to read the ideas they came up with.
As it happens, I’m (re-)reading Mackay’s EXTRAORDINARY POPULAR DELUSIONS AND THE MADNESS OF CROWDS. I’m on the alchemy chapter now, and I found myself thinking of the exact analogy you drew between alchemy and AI research. It’s very exciting to be in a field that is making great strides on local problems—but that’s no guarantee that the guiding star of the overall research program is within the grasp of current methods!
And similarly, when we get to those guiding stars, we often find that the science involved has quite different pragmatic implications than we thought, centering on different questions than the ones we first asked. Nuclear physics’s ability to turn lead into gold is a peripheral factoid with no practical concern, and wildly overshadowed by a dozen other more-meaningful applications. Powered flight’s impact on society is dramatically more concentrated on its ability to move a jetliner of mass hundreds of miles in an hour than it is on the (at this point still roughly implausible) ability to let an individual person flutter around a city like a sparrow. And so on.