I - Source Code, AI, and Copilot
A. Programming Languages and Source Code
Computer programs, from the calculator app on your phone to the latest big budget video game, begin their life as text files. Such a file contains instructions written in a particular “programming language,” which consists of both a formal grammar specifying syntax rules for instructions and a program that translates the instructions into a form accepted by the actual computer. Most programs are written in relatively human-readable languages whose resulting code is called “source code”; this contrasts with the instructions that devices actually use, often called “assembly code.” Source code is eventually translated to assembly code that computers can directly execute.
While writing computer programs is made simpler by human-readable programming languages, the task is difficult, and the result of the process – source code – is itself complex. To help write and understand code, many programmers look beyond basic text editors and use an “integrated development environment” (IDE), an application that combines a code editor with other helpful features like syntax highlighting1Syntax highlighting colors different types of tokens in code to more easily identify and distinguish them, and can be thought of as akin to uniquely coloring nouns, verbs, and other parts of speech in human language text. and tools for debugging.2What is an IDE?, RED HAT (Jan. 8, 2019), https://www.redhat.com/en/topics/middleware/what-is-ide. With tools in place, the task of writing a program starts with defining the overall purpose of the program, such as a web browser, a program that searches for text among a computer’s files, or a tool that performs date and timezone operations. From here the task becomes one of organization and mental compartmentalization. Source code for nontrivial programs does not exist in a single lengthy file of sequentially executed lines of code. Code is split up and organized in various ways, and this organization is simultaneously a means and an end; elegant organization of the completed source code is a goal to work toward, but it also helps achieve completion by breaking down the problem into manageable pieces.
One method of organization is to spread code across multiple files, with the code in each file serving some particular function, like image processing or database access.3E.g., Source Code for File cli_parser.py in Apache Airflow Project, GITHUB, https://github.com/apache/airflow/blob/main/airflow/cli/cli_parser.py (last visited Dec. 11, 2021) (listing source code for parsing commands entered into a command line interface, or CLI, in a file named cli_parser.py). The computer program, as embodied in source code, will be distributed across multiple files, which are later brought together by the programming language. Deciding how to divide up the program’s overall functionality into separate files is one aspect of code organization.
Code is also organized in ways enabled by the particular programming language. The simplest and most common such method is the function (also known as a subroutine or procedure). A function is a named sequence of lines of code which can be invoked elsewhere in the code via its name. It may accept input which is then used by its code. Functions in a program are like buttons and levers to be pushed and pulled by other parts of the code to perform various tasks. A simple example would be a function that calculates the average of three numbers. Assume that you need to average three numbers many times in the overall program. In lieu of repeating the actual symbolic mathematical expression, (a + b + c) / 3, throughout your code, an available function would let you replace it with average(a, b, c). Here the length of the code replaced by the function name is tiny, but in practice a function name replaces multiple lines of code. And even in this trivial example, the function name “average” is more easily read by a human reader than the underlying mathematical symbols.
It can be helpful to think of a function as comprising two parts: a declaration and an implementation. This distinction will matter later in the discussion. First, the declaration contains the name of the function and the specification of its inputs, which “declares” how to invoke the function. A function exists to be used, and the declaration tells you how to use it: by typing this name into your code and adhering to the listed input specification. Second, the implementation is the code that actually performs the function’s task. Or to put it another way, the implementation implements the function’s functionality. For example, see the three number average function below written in Python. The line numbers on the left hand side are not part of the code, but are merely there to identify the lines. Line 1 comprises its declaration; it contains a keyword signaling to Python that this is a function (def), the function name (average), and input specification. Lines 2–4 make up its implementation. They calculate and yield the average of the three input numbers.
1 def average(a: int, b: int, c: int): 2 sum = a + b + c 3 result = sum / 3 4 return result
Whereas declaration code does not really do anything, implementation code actually performs tasks and achieves results. Craft is needed to organize code into functions and write declarations, but a different kind of craft is needed to implement. The number of possible implementations grows alongside the complexity of the task being performed, and not all implementations are of equal quality. A correct but slow implementation might be quickly apparent to most programmers, while a better and faster implementation may elude a novice. Various considerations affect implementation code: the programmer’s knowledge and skill; the ultimate hardware environment (e.g. a device with less memory might require more judicious use of resources); ease of future reading by later programmers maintaining the code; and adherence to style conventions4A style convention or style guide for a given programming language is a set of dictates to programmers. They aim to standardize certain aspects of code and make it less idiosyncratic to each programmer. The purpose of style conventions is to help future readers (including the original author) understand the code’s logic by avoiding the mental effort of adjusting to a unique way of writing code. For example, the Python style guide suggests that function names “should be lowercase, with words separated by underscores as necessary to improve readability.” Guido Van Rossum, Barry Warsaw, & Nick Coghlan, PEP 8 -- Style Guide for Python Code, PYTHON.ORG (Aug. 1, 2013), https://www.python.org/dev/peps/pep-0008/#function-and-variable-names (last visited Dec. 11, 2021)..
A final organizational tool supported by programming languages is the class or object.5AVINASH C. KAK, PROGRAMMING WITH OBJECTS 29 (2003). With classes, programmers can define “entities” within their program that contain specific attributes and have functions attached to them. For example, a spreadsheet program might have a class that represents a cell in the spreadsheet. The cell class could contain attributes like its background color and current value, and it could have functions attached to it that delete or update its contents. Classes introduce another mental model for organizing one’s code that may better fit how people view the world, because creating and manipulating a “thing” is relatively intuitive.
Most programming languages permit programmers to insert non-functional “comments” into source code, and knowing about them will help our discussion later. Comments are typically prepended by a special character or two like // or #, which tells the language to functionally ignore that line. Programmers use comments to insert commentary about the code. Source code is relatively human-readable, but it still does not explain itself to readers very well, and comments help with that. Functions are frequently accompanied by comments called “docstrings” outlining what the function does, explaining how it should be used, and describing any inputs. The existence of comments as a feature indicates that future readers of code are an important audience considered by programmers. Indeed, the official style guide for Python states the common programming mantra that “code is read much more often than it is written.”6Van Rossum, Warsaw, & Coghlan, supra note 5.
Many computer programs are built collaboratively. A project can get so large and complex that multiple programmers are tasked with its implementation and maintenance. Software is often collaborative in a more indirect manner, too. A programmer may incorporate code written by another person into their own project because it serves a particular need of theirs, and it would be inefficient to write their own similar code of equal quality. As a result, programs are often composed of a mix of program-specific code and other people’s code. A stark example of this is cryptography code that secures digital communications. The mathematics needed to build secure cryptographic algorithms is so exacting that programmers are told to never “roll your own crypto,” lest they make a mistake and leave communications vulnerable to attackers.7JENNY BLESSING ET AL., YOU REALLY SHOULDN’T ROLL YOUR OWN CRYPTO 1 (2021). Instead, a program needing cryptography will incorporate code written by domain experts, obviating the need for the programmer to build that expertise themselves. For lawyers, an analogous practice would be using a tried and proven contract clause rather than drafting one from scratch.
Often this “other” source code is made publicly available for use by others and is dubbed “open source” software (OSS). OSS is an important part of the software industry. The two most common web server programs8A web server is the program that accepts HTTP (the protocol underlying the web) requests from your device and responds with the content you requested, like a website., accounting for 59% of the market, are both open source.9November 2021 Web Server Survey, NETCRAFT (Nov. 23, 2021), https://news.netcraft.com/archives/2021/11/23/november-2021-web-server-survey.html (last visited Dec. 11, 2021) (finding that web servers nginx and Apache have 35% and 24%, respectively). Software building blocks like the popular programming language Python is, too.10Source Code Repository for Standard Python Implementation CPython, GITHUB, https://github.com/python/cpython (last visited Dec. 11, 2021). Open source operating systems power over 78% of websites.11Usage Statistics of Operating Systems for Websites, W3Techs, https://w3techs.com/technologies/overview/operating_system (last visited Dec. 11, 2021) (finding that Unix powers over 78% of websites). OSS is commonly made available subject to an open source license that imposes conditions on the project’s users. These licenses will be discussed later.
B. Training and Machine Learning
To enable this paper’s later discussions, a brief foray into machine learning (ML) is required. ML is seen as a subfield within artificial intelligence (AI),12I will use the terms ML and AI interchangeably. and its aim is to make “machines get better at some task by learning from data, instead of having to explicitly code rules.”13Aurélien Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow ch. 1 (2019). At a high level ML systems are like functions: it takes in some input and returns an output. A common example is an email spam filter, which takes as input information about a particular email and returns a prediction about whether it is spam.14Id. What distinguishes ML systems is that they are developed by a process called training (or learning). Some preliminary information is needed to understand that process.
A developer15“Software developer,” “developer,” and “software engineer” are all common terms roughly synonymous with programmer. of an ML system seeks to build a high-performing model for the task at hand. An ML model is like any other mathematical or scientific model, in that it seeks to explain and predict empirical phenomena. Think of scientists formulating the atomic model, where matter is thought to be made of atoms, themselves made of a cluster of neutrons and protons surrounded by electrons.16See Rhett Allain, The Development of the Atomic Model, WIRED (Sep. 4, 2009, 10:59 PM), https://www.wired.com/2009/09/the-development-of-the-atomic-model/ (last visited Dec. 11, 2021). This model was arrived upon after reviewing experimental data and attempting to find a tidy explanation.17See id. A good model is one that, although based on past empirical data, will be able to generalize and accurately predict future observations. The idea of an ML model is similar.
As an example, imagine we are predicting a home’s price given its square footage in a particular city. A simple model for trying to predict the price is a particular straight line on a two-dimensional graph, with square footage on the x-axis and price on the y-axis. The line attempts to accurately capture the relationship between square footage and price, or the rate at which price increases alongside square footage. To predict the price of a home, we find the point on the line corresponding to its square footage and then reference the associated price. The model, which is simply a line, can be distilled to two numeric configuration variables, or parameters: the line’s slope and its y-intercept.18A line with slope m and y-intercept b can be represented entirely by the equation y = mx + b. In our example, x would be the input square footage value, and the result y would be the output price. The values of the parameters determine the line’s form, which is tantamount to determining the model’s accuracy. Different parameter values will yield a different model and different price predictions. If the line’s slope is too steep, then a tiny increase in square footage results in a larger price increase than is found in real homes. The task of creating a useful model reduces to finding good values for the model’s parameters.
There are many models available for specific problems in ML, and they are more complex than a straight line.19See Géron, supra note 13 at ch. 1 (listing k-nearest neighbors, linear regression, and neutral networks as examples of supervised learning models); Terence Shin, All Machine Learning Models Explained in 6 Minutes, TOWARDS DATA SCIENCE (Jan. 5, 2020), https://towardsdatascience.com/all-machine-learning-models-explained-in-6-minutes-9fe30ff6776a (last visited Dec. 11, 2021) (listing example models grouped by high-level tasks like classification and regression). A model can have millions of parameters.20See WILLIAM FEDUS ET AL., SWITCH TRANSFORMERS: SCALING TO TRILLION PARAMETER MODELS WITH SIMPLE AND EFFICIENT SPARSITY 1 (2021) (describing a model with a trillion parameters). Each has a generic structure, and the goal in ML is to fit the chosen model to a particular task by finding parameter values that lead to good results. Training is the process of determining those values.21See Descending into ML: Training and Loss, GOOGLE: MACHINE LEARNING CRASH COURSE https://developers.google.com/machine-learning/crash-course/descending-into-ml/training-and-loss (last visited Dec. 12, 2021). Training aims to systematically use the information found in many data examples, collectively called training data, to fine-tune the parameters such that the resulting model fits the training data and can generalize to future examples.22See Id. In our home price example, training data would consist of many actual square footage-price pairs, and our goal would be to find a line that best fits that data. The point to take away from this discussion is that ML systems are built by training a model, which entails processing many examples of data. Real life systems are much more complex than our toy example. In particular, by processing large numbers of training examples, some “AI systems can learn patterns inherent in human-generated data and then use those patterns to synthesize similar data which yield increasingly compelling novel media.”23CULLEN O’KEEFE ET AL., COMMENT REGARDING REQUEST FOR COMMENTS ON INTELLECTUAL PROPERTY PROTECTION FOR ARTIFICIAL INTELLIGENCE INNOVATION 2 [hereinafter OpenAI comment].
C. Putting it Together: Introducing Copilot
In June 2021, software company GitHub announced a technical preview of a new product called Copilot.24Nat Friedman, Introducing GitHub Copilot, THE GITHUB BLOG (Jun. 29, 2021), https://github.blog/2021-06-29-introducing-github-copilot-ai-pair-programmer/ (last visited Dec. 12, 2021). Dubbed “your AI pair programmer,”25Pair programming is a practice where two programmers write code at one computer, with one wielding the keyboard and the other making suggestions, spotting errors, acting as a sounding board, etc. See Jason Garber, Practical Pair Programming ch. 1 (2020). Copilot is an autocomplete tool for writing source code. When installed in a programmer’s IDE26Supra note 2., it will process the file being worked on and suggest lines of source code. The tool processes both source code and human language found in comments, and uses both to make suggestions.
Copilot’s product page provides demonstrations of its use. In one, a programmer is working in a file with some preexisting code that defines an object27See supra note 5 and accompanying text. representing a particular execution of a program, called a “run.” The run object contains attributes, such as whether it failed and its duration in milliseconds (called its “runtime”). The programmer in this example writes a comment explaining a new function’s purpose (a docstring), “Get average runtime of successful runs in seconds,” followed by the start of a function declaration, “func averageRunTimeInSeconds.”28GitHub Copilot, GITHUB, https://copilot.github.com/ [hereinafter Copilot product page] (last visited Dec. 12, 2021). The purpose of the program is clear to a programmer: it will calculate and yield the average runtime of all the successful runs in some collection of runs. With the human language description and function name in hand, Copilot suggests the function’s input parameters (i.e. the names and data types of each input), the type of data the function will yield (a decimal number), and a twelve line implementation. The suggested implementation processes each input run to check if it succeeded by referencing its failure attribute, all the while accumulating a sum of the successful runs’ runtimes and the number of successful runs.29Id. After processing them all, the average runtime is calculated and yielded by the function.30Id.
Even this simple example reveals several enticing attributes of the tool. Copilot understood the programmer’s purpose as reflected in their English language docstring and function name, and generated code accomplishing that purpose. Copilot’s suggestion incorporated the programmer-supplied run object. Copilot was aware of the object’s existence and used it as a function input; it was also aware of the run’s failure and runtime attributes, and accessed them to perform the calculation. In addition, a programmer-supplied comment stated that the unit of time for the object’s runtime duration attribute was milliseconds, whereas the docstring and function name referred to the average runtime in seconds. Copilot’s implementation first calculated the average in milliseconds and then divided that result by one thousand. The implementation of this particular code is straightforward for most programmers, but that is part of Copilot’s appeal: it can save programmers from mundane work.
Copilot uses its own production version of an AI model called Codex, built by a company named OpenAI.31Id. Codex was developed by copying and processing many training examples, as discussed above. According to the Copilot product page, its training data consisted of “a selection of English language and source code from publicly available sources, including code in public repositories on GitHub.”30Id. This suggests some of the training data came from sources other than GitHub.33This may not be true; a paper by OpenAI evaluating the correctness of Python code generated by Codex only mentions GitHub as a source of training data. It may be that the productionized version of Codex powering Copilot was trained on a distinct training set, or that the paper was imprecise in describing the training data. MARK CHEN ET AL., EVALUATING LARGE LANGUAGE MODELS TRAINED ON CODE 4 (2021). Later, this paper will outline how use of these training examples by Copilot may present copyright liability for users of the product or GitHub itself.
II - Code, Copyright, and the Concern with Copilot
A. Code and Copyright
To appreciate the potential for copyright liability presented by Copilot, it is helpful to understand copyright policy and computer programs’ unique place within the regime. The Constitution grants Congress the power “[t]o promote the progress of science and useful arts, by securing for limited times to authors and inventors the exclusive right to their respective writings and discoveries.”34U.S. CONST. art. I, § 8, cl. 8. Courts interpret the purpose of copyright pragmatically, as inducing authors to produce and disseminate works with the temporary grant of valuable exclusive rights.35See e.g. Google LLC v. Oracle Am., Inc., 141 S. Ct. 1183, 1195 (2021); Mazer v. Stein, 347 U.S. 201 at 219 (1954) (“The economic philosophy behind the clause . . . is the conviction that encouragement of individual effort by personal gain is the best way to advance public welfare through the talents of authors.”). The reward to authors is subsidiary to the ultimate goal of giving the public access to the authors’ work for use and enjoyment.36See Sony Corp. of Am. v. Universal City Studios, Inc., 464 U.S. 417, 429 (1984) (“The monopoly privileges . . . [are not] primarily designed to provide a special private benefit. Rather, the limited grant is a means by which an important public purpose may be achieved.”); Authors Guild v. Google, Inc., 804 F.3d 202, 212 (2d Cir. 2015) (“[W]hile authors are undoubtedly important intended beneficiaries of copyright, the ultimate, primary intended beneficiary is the public, whose access to knowledge copyright seeks to advance by providing rewards for authorship.”). The Copyright Act is the embodiment of this bargain with respect to expressive works eligible for copyright protection; it is the particular balance struck by Congress “[t]o promote the progress of science and useful arts.”37U.S. CONST. art. I, § 8, cl. 8.
Copyright protection is available to computer programs.38In 1974, Congress established a commission to study copyright law’s interaction with new technologies. See Pub.L. No. 93–573, § 201, 88 Stat. 1873 (1974). Its final report recommended that the Copyright Act of 1976 (the copyright statute currently in force) be amended to make clear that copyright covers computer programs. National Commission on New Technological Uses of Copyrighted Works, Final Report 1 (1978). The Copyright Act was subsequently amended to incorporate the suggestion. See Pub.L. No. 96–517, § 10(a), 94 Stat. 3028 (1980). Section 102(a) of the Copyright Act states the two threshold requirements for obtaining copyright. It says that copyright subsists in “original works of authorship” that are “fixed in any tangible medium of expression.”3917 U.S.C. § 102(a). A computer program of a “minimal degree of creativity” satisfies the originality requirement.40See Feist Publications, Inc. v. Rural Tel. Serv. Co., 499 U.S. 340, 345 (1991) (“[T]he requisite level of creativity is extremely low; even a slight amount will suffice.”). Saving a program to a computer’s hard disk or memory satisfies fixation.41See MAI Sys. Corp. v. Peak Computer, Inc., 991 F.2d 511, 518 (9th Cir. 1993). Section 102(a) then lists categories of works that count as “works of authorship,” including “literary works”.4217 U.S.C. § 102(a). Computer programs are “literary works” per the statute because they consist of a sequence of letters, numbers, and symbols—at that abstract level, they are akin to books.431 Melville B. Nimmer and David Nimmer, Nimmer on Copyright § 2A.10[B] (Matthew Bender, Rev. Ed. 2021) [hereinafter Nimmer on Copyright] (“It is . . . firmly established that computer programs qualify as work of authorship in the form of literary works, subject to full copyright protection.”); see also Computer Assocs. Int'l, Inc. v. Altai, Inc., 982 F.2d 693, 702 (2d Cir. 1992) (“[T]he legislative history leaves no doubt that Congress intended [computer programs] to be considered literary works.”) (citation omitted).
Notwithstanding its plain inclusion in copyright, code44When discussing computer programs in the context of copyright, I will use “computer program,” “source code,” and “code” interchangeably. sits awkwardly within the regime because of its utilitarian nature. Copyright extends only to an author’s particular expression of an idea, and not the idea itself.45See Baker v. Selden, 101 U.S. 99, 104 (1879). This notion is dubbed the idea-expression dichotomy. Thus, a programmer cannot copyright the idea of sorting a list of numbers, but they may be able to copyright their particular expression of that idea: an original sorting algorithm. Without this barrier in place, copyright law could give an author long-lasting exclusive control over an idea.46Id. at 105 (“The description of the art in a book, though entitled to the benefit of copyright, lays no foundation for an exclusive claim to the art itself. The object of the one is explanation; the object of the other is use. The former may be secured by copyright. The latter can only be secured, if it can be secured at all, by letters-patent.”). The idea-expression dichotomy resides in section 102(b) of the Copyright Act, which says that, “[i]n no case does copyright protection . . . extend to any idea, procedure, process, system, method of operation, concept, principle, or discovery, regardless of the form in which it is described, explained, illustrated, or embodied in such work.”4717 U.S.C. § 102(b). Because the idea of code—the general process or procedure performed when it executes—is to some extent intertwined with code’s expression in source code, section 102(b) withholds protection for some elements of code. Put another way, insofar as code is a “procedure, process, system, [or] method of operation,” it is not copyrightable. All code arguably fits into the excluded categories. In his thoughtful Lotus concurrence, Judge Boudin commented that “if taken literally [section 102(b)] might easily seem to exclude most computer programs from protection.”48Lotus Dev. Corp. v. Borland Int'l, Inc., 49 F.3d 807, 820 (1st Cir. 1995), aff'd, 516 U.S. 233 (1996) (BOUDIN, J., concurring). Tellingly, as mentioned above, another name for a function is a procedure, and there is a class of programming languages called procedural programming languages. For these reasons, Boudin quipped that “[a]pplying copyright law to computer programs is like assembling a jigsaw puzzle whose pieces do not quite fit.”49Id.
B. Copilot and Copyright
Copilot presents copyright issues because it is an AI system developed by copying and processing many training examples. This paper will focus on the copyright implications stemming from Copilot’s source code training data, not the English language training data.50The English language training data is outside of the scope of this paper for two reasons. First, the source and nature of the English language training data is unclear. Second, the differing natures of code and English language works would require two branches of analysis, and the author was more interested in analyzing code. The main business of Copilot creator GitHub involves hosting users’ code in “repositories,” and users make much of that code publicly available.51In contrast, a repository can be set to private, with its contents only accessible by those to whom the owner grants permission. See Setting Repository Visibility, GITHUB DOCS, https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/managing-repository-settings/setting-repository-visibility (last visited Dec. 16, 2021). The source code training data can be thought of as comprising two buckets: code from public GitHub repositories, and code from non-GitHub sources. The former is perhaps less likely to generate an actionable copyright violation by GitHub because its terms of service state that users grant the company a license to “make incidental copies [of user code], as necessary to . . . improv[e] the Service over time.”52GitHub Terms of Service, GITHUB (Nov. 16, 2020), https://docs.github.com/en/github/site-policy/github-terms-of-service#4-license-grant-to-us (last visited Dec. 12, 2021). In any action based on GitHub’s use of user source code, the company can argue that creating a new product like Copilot constitutes “improving the Service over time.”53Id. If the terms of service do not cover such use by GitHub, however, then the first bucket of training data would present potential copyright liability. The second bucket, i.e. the non-GitHub training data, seems likelier to present a copyright issue, because the authors of those examples have not granted GitHub a license to use their code.
The creation of Copilot itself may constitute copyright infringement by GitHub.54I will assume that GitHub gathered together the training data for its production version of Codex and performed the training. Another actor, such as OpenAI, may have actually gathered the training data and trained the model before delivering it to GitHub, in which case the analysis undertaken here can be directed toward that other actor. Even though the code in the training data is publicly available, it is still protected by copyright. Publicly available code is often subject to a particular open source license.55See e.g. License File for Kubernetes Project, GITHUB, https://github.com/kubernetes/kubernetes/blob/master/LICENSE (last visited Dec. 12, 2021) (using Apache License); License File for iTerm2 Project, GITLAB, https://gitlab.com/gnachman/iterm2/-/blob/master/LICENSE (last visited Dec. 12, 2021) (using GNU General Public License); License File for libstdc++ Project, GNU PROJECT, https://gcc.gnu.org/onlinedocs/libstdc++/manual/license.html (last visited Dec. 12, 2021) (using GNU General Public License). Even open source code not clearly subject to a license still has protection, it is simply being made available to read. Open source licenses “tell the world the conditions under which they can and can’t use an open source project.”56The Developer’s Guide to Open Source Software Licenses, FOSSA, https://fossa.com/developers-guide-open-source-software-licenses (last visited Dec. 12, 2021). At least one court has held that violating the terms of an open source license is enforceable as copyright infringement, rather than a breach of contact.57See Jacobsen v. Katzer, 535 F.3d 1373, 1382–83 (Fed. Cir. 2008). If Copilot’s use of a training example violates an example’s license, there is a plausible legal pathway for that use to be copyright infringement.
One popular license with exacting conditions on licensees is the GNU General Public License (GPL), self-described as “a free, copyleft58Copyleft refers to using the exclusive rights given to copyright owners to ensure that “anyone who redistributes the software, with or without changes, must pass along the freedom to further copy and change it.” What is Copyleft?, GNU PROJECT, https://www.gnu.org/licenses/copyleft.en.html (last visited Dec. 12, 2021). license for software and other kinds of works.”59The GNU General Public License, GNU PROJECT, https://www.gnu.org/licenses/gpl-3.0.en.html [hereinafter GPL] (last visited Dec. 12, 2021). In 2015, around twenty percent of repositories on GitHub were clearly licensed.60Ben Balter, Open Source License Usage on GitHub.com, THE GITHUB BLOG (Mar. 9, 2015), https://github.blog/2015-03-09-open-source-license-usage-on-github-com/ (last visited Dec. 12, 2021). Of those, over twenty percent used the GPL.61Id. GitHub’s use of code licensed under the GPL to train Copilot may violate the GPL’s terms. The GPL imposes conditions on certain uses of any work “based on” a licensed work, including works that “adapt all or part of the work in a fashion requiring copyright permission.”62GPL, supra note 59, § 0. Copilot is arguably based on the training examples because it copies and processes them to fine-tune its parameters, and thus adapts the training examples in a fashion requiring copyright permission.63If this is true, then GitHub also accepted the license. Id. § 9. If Copilot is based on a GPL licensed training example, GitHub is obligated to make available the Copilot source code.64Id. § 6. Since Copilot is not open source, GitHub is not adhering to that condition and may be liable for infringement as a result.
The suggestions made by Copilot may present copyright issues, too. According to the company’s own research, the tool’s “suggestion may contain some snippets that are verbatim from the training set” “about 0.1% of the time.”65Copilot product page, supra note 28. This means that code from a training example might be given verbatim as a suggestion to a Copilot user. A tenth of a percent is a small percentage, but means that one thousand suggestions per million could be verbatim copies. In addition, a suggestion may not be a verbatim copy but still be similar enough to a training example to be considered a legal copy or derivative work.66For example, translating an original computer program to a different programming language may count as creating a derivative work. See e.g. Tradescape.com v. Shivaram, 77 F. Supp. 2d 408, 413 (S.D.N.Y. 1999). Copyright owners have the exclusive right to prepare derivative works from works they own. 17 U.S.C. § 106(2).
To summarize, both Copilot’s creation and its suggestions present plausible prima facie cases of copyright infringement of the training examples. First, the author of a training example code has a copyright in that work if it is sufficiently original, because code is copyrightable and it is fixed within the machine on which it is hosted.67Supra, § II.A. Some of the training examples are likely made available under open source licenses the violation of which may be actionable as copyright infringement. Copilot did not obtain authorization from their authors, and its release may violate those licenses. A suggestion by the tool of verbatim code from a training example constitutes a literal copy of a portion of that work, albeit an improbable and indirect copy. Each of those acts presents at least potential copyright liability of the training examples, which carries statutory damages of up to $30,000 per work infringed (or up to $150,000 per work infringed if the infringement is found to be “willful”).6817 U.S.C. § 504(c)(1)–(2). Whether these are likely to result in liability is the subject of the rest of this paper.
III - Assessing the Infringements
A. Copying to Train Copilot
We laid out above that training Copilot presents a prima facie case of copyright infringement. Our analysis does not end there, because the exclusive rights given to copyright owners by the Copyright Act are explicitly subject to statutory limitations. A copyright owner’s exclusive rights, including the reproduction right, are “[s]ubject to sections 107 through 122.”6917 U.S.C. § 106. A complete assessment of GitHub’s liability must check that their use does not fall within the list of limitations.
In particular, GitHub may escape liability by arguing that the training of Copilot is a fair use of the training examples. Section 107 of the Copyright Act states that in spite of the exclusive rights granted to copyright owners in section 106, “the fair use of a copyrighted work, including such use by reproduction in copies . . . for purposes such as criticism, comment, news reporting, teaching . . . scholarship, or research, is not an infringement of copyright.”7017 U.S.C. § 107 (emphasis added). One who engages in a fair use of a copyrighted work is not liable for copyright infringement. Note the words “such as,” indicating that the list of public-minded uses like teaching and commentary is merely illustrative.71Id.
Section 107 goes on to provide guidance to those judging whether a particular use is fair by laying out four factors to be considered. These factors will be applied to Copilot later. Section 107 introduces the four factors by stating that “the factors to be considered shall include.”72Id. (emphasis added). Per the language, fair use analyses by courts must consider these factors (shall), but they are nonexclusive and courts may consider other factors (include). The four fair use factors grant helpful structure to fair use analyses, but they are not an algorithm for formulaically deciding whether a use is fair. To the contrary, fair use doctrine is flexible.73See e.g. Campbell v. Acuff-Rose Music, Inc., 510 U.S. 569, 577 (1994) (noting that the fair use inquiry is not “simplified with bright-line rules, for the statute, like the doctrine it recognizes, calls for case-by-case analysis.”); Google LLC v. Oracle Am., Inc., 141 S. Ct. 1183, 1197 (2021) (stating that fair use “is flexible . . . and . . . its application may well vary depending upon context.”); Sony Corp. of Am. v. Universal City Studios, Inc., 464 U.S. 417, 448 (1984) (describing fair use as an “equitable rule of reason.”). It “permits courts to avoid rigid application of the copyright statute when, on occasion, it would stifle the very creativity which that law is designed to foster.”74Stewart v. Abend, 495 U.S. 207, 236 (1990) (internal quotations omitted).
1. Transformativeness and Exact Copies
Since the Supreme Court’s decision in Campbell, fair use analysis has often been guided by the inquiry into whether the use of the plaintiff’s work is “transformative.”75See Campbell, 510 U.S. at 579; Mark A. Lemley & Bryan Casey, Fair Learning, 99 TEX. L. REV. 743, 782–83 (2021) (“Transformative use has arguably swallowed fair use doctrine in the past twenty-five years.”); Rebecca Tushnet, Content, Purpose, or Both?, 90 WASH. L. REV. 869 (2015) (“Transformativeness has indeed become almost synonymous with fairness . . . .”). This inquiry is couched within the first fair use factor, “the purpose and character of the use.”7617 U.S.C. § 107(1). It asks whether the defendant’s use “adds something new, with a further purpose or different character, altering the first with new expression, meaning, or message.”77Campbell, 510 U.S. at 579. The Court’s justification for this view is that a work which “adds something new and important” advances copyright’s constitutional goal of promoting the arts and sciences.78Google LLC v. Oracle Am., Inc., 141 S. Ct. 1183, 1203 (2021). See Campbell, 510 U.S. at 579. In Campbell, the concept of transformativeness debuted in the context of parody.79Campbell, 510 U.S. at 571–72. Unspooling the concept from there leads naturally to an idea of content-transformativeness, where a later user incorporates a copyrighted work into their own work and makes an equal claim as an author.80See Tushnet, supra note 75, at 882. Indeed, courts have found transformative a visual artist’s inclusion of an existing work into their own.81E.g. Blanch v. Koons, 467 F.3d 244, 259 (2d Cir. 2006).
Less expectedly, courts have also come to recognize transformative purpose, where a defendant uses an exact copy of a work to accomplish an objective distinct from and orthogonal to the work’s original objective.82See Tushnet, supra note 75, at 869, 878 (describing purpose-transformativeness as alternatively where “a work is reproduced wholesale or nearly so, but in a different context,” and where “a defendant has a different interpretive or communicative project than the plaintiff did in creating the original work.”). One flavor of purpose-transformativeness is where a defendant uses many copyrighted works to create a tool, product, or service that would not otherwise exist, and whose purpose is additive to or different from that of the employed works.83Id. at 877–78. Thus a plagiarism detection service is transformative because it uses submitted papers not for their expressive content, but detecting and deterring plagiarism.84See A.V. ex rel. Vanderhye v. iParadigms, LLC, 562 F.3d 630, 640 (4th Cir. 2009). As with content-transformative uses like parody, there must be a “justification for the taking” of the allegedly infringed work.85Authors Guild v. Google, Inc., 804 F.3d 202, 215 (2d Cir. 2015). Later we will discuss transformativeness as applied to Copilot.
Courts have accommodated new technological uses of copyrighted works that would otherwise be infringement by employing the idea of transformative purpose.86See e.g. Sega Enterprises Ltd. v. Accolade, Inc., 977 F.2d 1510, 1527 (9th Cir. 1992), as amended (Jan. 6, 1993) (“We are not unaware of the fact that to those used to considering copyright issues in more traditional contexts, our result may seem incongruous at first blush. . . . However, the key to this case is that we are dealing with computer software, a relatively unexplored area in the world of copyright law.”); Oracle, 141 S. Ct. at 1197 (stating that the “application of [fair use] requires judicial balancing, depending upon relevant circumstances, including significant changes in technology.”) (quoting Sony, 464 U.S. at 430). In Sega Enterprises Ltd. v. Accolade, Inc., the defendant copied the source code of three of the plaintiff’s video games to gain access to unprotectable parts of the games’ code.87Sega, 977 F.2d at 1518. The plaintiff alleged that the intermediate copies made in the reverse engineering process constituted infringement.88Id. at 1516. The court held that the use was fair for two reasons.89Id. at 1520. First, the defendant’s ultimate use, to gain access to unprotectable functional elements of the works, is legitimate under copyright.90Id. at 1522–23. Second, the copying was done to achieve that legitimate purpose, and was the only means of doing so.91Id. at 1526. Sega predates Campbell and lacks the language of transformativeness. But it is consistent with the idea of permitting exact copies where the purpose of copying sidesteps a work’s ordinary, expressive use. Sega stands for the proposition that an ostensibly infringing act which is part of a larger “non-expressive” use may be a fair use.92Benjamin L. W. Sobel, Artificial Intelligence's Fair Use Crisis, 41 COLUM. J.L. & ARTS 45, 52 (2017) (citing James Grimmelmann, Copyright for Literate Robots, 101 IOWA L. REV. 657, 662 (2016)). As will be seen later, if Copilot’s use of the training examples can similarly be said to be orthogonal to the examples’ ordinary use, then copying in service of that use may be a fair use.
Some transformative purpose cases look more like Copilot because the defendant used many copyrighted works. The defendant in Kelly v. Arriba Soft Corp. operated an Internet search engine that displayed results in the form of reduced-size thumbnail versions of images it had discovered while scanning the web.93Kelly v. Arriba Soft Corp., 336 F.3d 811, 815 (9th Cir. 2003). The Ninth Circuit held that the defendant’s inclusion of the plaintiff’s copyrighted photographs in the search engine results was transformative and a fair use.94Id. at 822. The search engine’s use of the works, “to help index and improve access to images on the internet,” was orthogonal to the works’ original use of “engag[ing] the viewer in an aesthetic experience.”95Id. at 818. The Ninth Circuit echoed their own reasoning in a later case involving Google’s image search engine and copyrighted photographs, noting that Google put the original works to “an entirely new use” distinct from the photographs’ original “entertainment, aesthetic, or informative function.”96Perfect 10, Inc. v. Amazon.com, Inc., 508 F.3d 1146, 1165 (9th Cir. 2007).
In a later case, Google copied millions of physical books to create the Google Books service, which had both a search and snippet function.97Authors Guild v. Google, Inc., 804 F.3d 202, 208–10 (2d Cir. 2015). In response to a user query, the search function returned a list of books in which the query terms appear, plus some other information about each book.98Id. at 209. The snippet function would reveal up to three horizontal sections of a page (“snippets”) per book containing a query term.99Id. at 209–10. The plaintiffs were copyright owners whose works appeared on the service.100Id. at 208. Google’s copying to achieve both the search and snippet functions was held a fair use.101Id. at 225. The Second Circuit found that Google’s purpose of “provid[ing] otherwise unavailable information about the originals” was “highly transformative.”102Id. at 215–216. The context provided by the snippet function was seen as sufficiently important to achieving the search’s transformative purpose to be brought beneath its fair use umbrella.103Authors Guild v. Google, Inc., 804 F.3d 202, 218 (2d Cir. 2015). This case is very similar to Copilot because it uses many copyrighted works and communicates small portions of them to the public verbatim, albeit less frequently.
In 2021, the Supreme Court revisited fair use and decided another relevant case, with Google once again sitting in as defendant.104Google LLC v. Oracle Am., Inc., 141 S. Ct. 1183, 1183 (2021). Oracle, the plaintiff, owned the copyright in a program called Java SE that enabled programs written in the Java programming language to run on any laptop or desktop computer.105Id. at 1190. The Java SE platform provided a set of pre-written classes and functions106The opinion correctly uses the term “method” to refer to a function associated with a class. To keep things simple, I am continuing to use the word “function.” The only extra information one gets when learning that something referred to is a method rather than a function is that it is associated with a class. which programmers could incorporate into their own custom programs.107Oracle, 141 S. Ct. at 1191. Similar classes were grouped into an organizational unit called a “package,” and together these packages, classes, and functions comprised the Java SE application programming interface (API).108Id. Two examples in the latest version109Version 17, as of this writing. of the Java SE API are a class for dealing with files110Documentation for File Class in Java SE, ORACLE: JAVA SE DOCUMENTATION, https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/io/File.html (last visited Dec. 12, 2021). and a function for generating a random number in a class named “Math.”111Documentation for Function Random in Math Class of Java SE, ORACLE: JAVA SE DOCUMENTATION, https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/lang/Math.html#random() (last visited Dec. 12, 2021).
The Java API reflects a point mentioned in our earlier discussion about code.112Supra, § I.A. Similarly to how we distinguished between a function’s declaration and implementation, the Court conceived of the Java API as comprising distinct declaring and implementing code.113Google LLC v. Oracle Am., Inc., 141 S. Ct. 1183, 1191–92 (2021). The declaring code comprised the function names and input specifications, as well as a function’s “placement within a particular class and the placement of a class within a particular package.”114Id. at 1192. As the Court explained, the declaring code provided the name that a programmer would invoke to trigger implementation code, and also serves “an organizational function” by reflecting “how we want [the] tasks arranged and grouped.”115Id. Importantly, the Court recognized the “I” in API by noting that “one can think of the declaring code as part of an interface between human beings and a machine.”116Id. A question and answer help illustrate this point. How would a programmer use the Java SE system to determine which of two numbers is larger, a functionality Java SE provides? They would call the API’s appropriate function using the name specified in the declaring code: java.lang.Math.max(x, y).117Id. at 1193. The declaring code specified how to access and use Java SE’s functionality.
After acquiring its namesake startup firm, Google began work on mobile phone operating system Android.118Id. at 1190. Google wanted to make it easier for existing Java programmers to write applications for Android, which would have its own API.119Google LLC v. Oracle Am., Inc., 141 S. Ct. 1183, 1192–93 (2021). To achieve this, Google copied the declaring code—the package names, class specifications, and function declarations—from thirty-seven packages of the Java API.120Id. at 1193. This copying aided Google in their goal because Java programmers could “rely on the [function] calls that they are already familiar with” while programming on Android.121Id. at 1194. “Without that copying, programmers would need to learn an entirely new system to call up the same tasks.”122Id. Google wrote its own implementing code and the declaring code for the remaining Android API packages, which outnumbered those copied.123Id.
The Supreme Court held that Google’s copying of the thirty-seven packages’ declaring code was a transformative and fair use.124Id. at 1209. The Court eschewed a superficial framing of Google’s use as identical to Oracle’s—that in both Java SE and Android, the copied declaring code lets programmers invoke computing tasks.125Google LLC v. Oracle Am., Inc., 141 S. Ct. 1183, 1203 (2021). Such a brief inquiry would “severely limit the scope of fair use in the . . . context of computer programs” because “virtually any unauthorized use of a copyrighted computer program . . . would do the same.”126Id. Google’s copying of some of the API’s declaring code was held to be transformative because it served their purpose of creating and “expand[ing] the use and usefulness of” a new product that “offer[ed] programmers a highly creative and innovative tool.”127Id. Oracle is relevant to Copilot because it dealt directly with copying code as a fair use, and involved the creation of a tool for programmers.
The creators of Codex, the model behind Copilot, have argued that copying training data as an intermediate step in the creation of an AI or machine learning (ML) system is a fair use of the training examples.128See OpenAI comment, supra note 23, at 4–10. GitHub feels similarly. Copilot’s product page says that “[t]raining machine learning models on publicly available data is considered fair use across the machine learning community.”129Copilot product page, supra note 28. Nat Friedman, GitHub’s CEO at Copilot’s unveiling, stated in an online discussion that “training ML systems on public data is fair use.”130Nat Friedman, HACKER NEWS (Jun. 29, 2021), https://news.ycombinator.com/item?id=27678354 (last visited Dec. 12, 2021). Whether they are correct remains to be seen, but an assessment of whether creating Copilot is a fair use requires a trip through the four fair use factors.
2. Copilot and Fair Use
a. Factor One: The Purpose and Character of the Use
As described above, factor one is dominated by transformativeness.131Supra, § III.A.1. Is Copilot a transformative use of the training examples? Does it add something new and important to those works, or does it merely supersede them? The case law demonstrates that taking exact copies of the entirety of many works does not preclude a finding of transformativeness or fair use. Hence, GitHub’s prima facie infringement may not lead to copyright liability. Purpose-transformativeness provides a useful category into which Copilot may fit. GitHub copied and used the copyrighted training examples to build something new and different, namely a generative AI system. Copilot seems to fall into the new creation flavor of purpose-transformativeness, because it would not exist without use of the training examples and it is put to a different purpose than the training examples. The training examples are meant to be employed as part of a computer program, whereas Copilot is a tool that suggests segments of code to programmers as they work. In addition, the dissemination of useful tools can further copyright’s overall goal of promoting science and the useful arts. In Oracle, the Supreme Court squarely stated that copying to create a new tool for programmers is “consistent with that creative ‘progress’ that is the basic constitutional objective of copyright itself.”132Google LLC v. Oracle Am., Inc., 141 S. Ct. 1183, 1191 (2021) (citing Feist Publications, Inc. v. Rural Tel. Serv. Co., 499 U.S. 340, 349–50 (1991)).
This brings us to a crucial question in any assessment of the purpose and character of Copilot: what is the scope of the inquiry? A court will likely include Copilot’s output into its analysis. The cases discussed above considered how the defendant’s product was used, and what it provided to its users. The Ninth Circuit in Kelly did not assess the purpose and character of a search engine generally. Instead, it surveyed the features and character of the disputed image search engine, and decided that the copyrighted photographs were being used for a different purpose than aesthetic enjoyment.133Kelly v. Arriba Soft Corp., 336 F.3d 811, 818 (9th Cir. 2003). The Second Circuit in Authors Guild inspected the output of the search and snippet functions of Google Books. Similarly, a court will probably include how Copilot is used in its analysis of the purpose and character of GitHub’s use. This might cut against GitHub, because Copilot’s ultimate purpose is to output content of the same type as the training examples: source code. In a sense, Copilot is a very complex way to do something not at all transformative, particularly when it suggests a verbatim copy from its training data. It copies source code to create an elaborate device that ultimately outputs source code. This is different from the Google Books and image search engines, which constructed their own additive uses distinct from the copied works’ primary use of accessing their expression. Transformative purposes are typically orthogonal to the purpose of the copied works. Including Copilot’s output in the assessment of its transformativeness makes its purpose more parallel than orthogonal.
On the other hand, Copilot can be viewed as furthering copyright’s overarching goals because it helps programmers create new works. Lest all copying of code be denied the defense of fair use, the Supreme Court in Oracle admonished courts not to cast the purpose and character of a later use as identical to the first in that they are both computer programs.134Google LLC v. Oracle Am., Inc., 141 S. Ct. 1183, 1203 (2021). Here too, a court may consider myopic the view that Copilot is an elaborate device that turns code into code. A wider view recognizes that it does much more in addition to the original work: it aims to increase programmer productivity by suggesting small amounts of code that implements the programmer’s current goal; it may make programming more enjoyable by obviating the need to write simple code and refer to documentation; and it can expose programmers to new services and programming methods.135Video and Source Code of Copilot Use, GITHUB, https://github.com/github/copilot-docs/tree/main/gallery/python-sentiment-analysis-of-text (last visited Dec. 13, 2021) (demonstrating using Copilot to discover an API to perform sentiment analysis of text). Facts like this helped Google reach a factor one result in favor of fair use in Oracle, where the Court credited that Google’s use to create Android gave programmers an innovative tool and could advance computer programming.136Oracle, 141 S. Ct. at 1203–04. GitHub can credibly assert that Copilot does the same thing by allowing programmers to focus on the more complex and creative code in their programs, and avoid the humdrum tasks whose implementation the tool can predict. By increasing the productivity of programmers, Copilot has a claim to giving them a similarly innovative tool that facilitates the creation of new programs, furthering copyright’s interest in the creation of new works.
An important unresolved question is the degree to which Copilot can be said to use, be interested in, or internally represent the expressive elements of its training data. Insofar as Copilot uses the expressive elements of a training example, its purpose, and its use, are less transformative. A system like Copilot which outputs expression would seem to use the expressive aspects of its training data to a greater degree than a system that does something expression-neutral like recognize the face in a photograph. This appears to be a battle of framing. The words used to describe the training process and the system’s functionality may impute those features with more or less use of expression in the training examples. And the words used are always going to be lacking where they are attempting to fit human language onto a large-scale mathematical operation. Under the hood, the processing of a single example will slightly alter the model’s parameters. Is Copilot analyzing the training data to gather metadata about the code? Is it extracting unprotectable information about the code, like recording the presence of a term in Google Books? Is it translating and representing the expression in the code within the model? The fact that a tool like Copilot is essentially an application of mathematics leans in favor of its use being understood as analysis, where the output being expression is more an illusion than a reflection of true understanding or use of expression in the training data. This question will need addressing by any future litigation based on Copilot or another similar generative system.
OpenAI, the creators of the model that powers Copilot, argue that training AI systems is transformative.137See OpenAI comment, supra note 23, at 5. They argue that training generative AIs has nothing to do with the expression in the training examples, but is rather helping the system “learn the patterns inherent in human-generated media.”138Id. This view aligns with the idea that any AI system’s ostensible expressive understanding or ability is an emergent illusory phenomenon rooted in mathematics. The company also argues that the outputs of a generative AI system are transformative because you cannot consume a training example—that work’s ordinary, expressive use—by looking at the system or its outputs.139Id. But this view of the use of given work as too specific. When courts assess whether a defendant’s purpose in using a work is orthogonal to the work’s purpose, they speak of categories. Two different action films have the same ordinary purpose: viewing to access the expression in the film. The purpose of a particular work is not to consume that particular work. If that were true, any new work that changed an existing work would be transformative by virtue of having a different identity. Instead, courts see whether the purposes are different in kind. A defendant’s purpose may not be readily credited as transformative where it is of the same kind or within the same category as the copied work, absent another transformative use like commentary or criticism.140E.g.Andy Warhol Found. for Visual Arts, Inc. v. Goldsmith, 11 F.4th 26, 37–43 (2d Cir. 2021).
Copilot’s transformative status could depend on where a court places emphasis. The product itself can lay claim to a transformative purpose in that it created a new and innovative product that can reduce friction in the subsequent creation of new computer programs. If emphasis is placed on that, this factor will likely tilt in GitHub’s favor. But the fact that its output is a work of the same type as its training data causes issues, because typical transformative purposes involve a defendant who uses works to generate a distinct or orthogonal result. Technically sophisticated or not, Copilot produces code, and that may preclude a finding of transformativeness if that is where a court focuses.
b. Factors Two and Three
Factor two, the nature of the copyrighted work, acknowledges that copyrighted works lie on a continuum whose poles correspond with wider and narrower scopes of fair use.141See Harper & Row Publishers, Inc. v. Nation Enterprises, 471 U.S. 539, 563 (1985) (“The law generally recognizes a greater need to disseminate factual works than works of fiction or fantasy.”) (citing Gorman, Fact or Fancy? The Implications for Copyright, 29 J. COPYRIGHT SOC. 560, 561 (1982)). More use of a work that is factual or functional is permitted than of a work that is more creative and expressive.142See Nimmer, 13.05[A][a]. While this factor is relatively unimportant,143See Authors Guild v. Google, Inc., 804 F.3d 202, 220 (2d Cir. 2015) (stating that factor two “has rarely played a significant role in the determination of a fair use dispute.”) (citation omitted). it likely weighs in GitHub’s favor because the training examples are source code. Code is more functional than other types of copyrightable works, and copyright allows more use of it than works like novels or films. In Oracle, the Court demonstrated that the nature of the copied work can be influential in fair use decisions by promoting this factor to the fore of its discussion.144Google LLC v. Oracle Am., Inc., 141 S. Ct. 1183, 1201 (2021). While there, as here, the copied work was code, application of the reasoning in Oracle can be slightly discounted because the code copied in that case was declaring code. The Court came close to outright declaring that declaring code is unprotectable because it is an interface, and is thus “inherently bound together with uncopyrightable ideas” like the choices made in dividing and organizing all possible tasks into packages, classes, and functions.145Id. at 1202. Copilot uses the entirety of its training examples, including their implementing code, which is closer to the “core of copyright.”146Id. Still, the idea-expression dichotomy puts code on weak copyright footing at the outset, and that will likely be reflected in this factor.
The nature of the training examples may also harken back to and influence Copilot’s transformativeness. As discussed above, the extent to which Copilot is accessing or using the protectable expressive elements of the training code is an unresolved question. But where the output generated by AI is more functional, like code, it stands to reason that the system is more interested in learning correctness from the training data rather than expressive elements. This would be like trying to learn grammar from human-language text, or accessing the unprotectable functional elements in Sega.147See supra, § III.A.1. Thus, in a fair use analysis of Copilot, code might allow more copying than other types of works.
The third fair use factor courts must consider is “the amount and substantiality of the portion used in relation to the copyrighted work as a whole.”14817 U.S.C. § 107(3). It calls for an assessment of the quantitative and qualitative amount of the original work used by the new work. The quantitative assessment is mostly straightforward and entails counting how much of the original work the new work uses; does the new work use one sentence out of a novel, or an entire chapter? The qualitative assessment asks which parts of the original work are used by the new work; does the new work use an unimportant portion, or does it use the “‘heart’” of the work?149Harper & Row Publishers, Inc. v. Nation Enterprises, 471 U.S. 539, 564–65 (1985). As the factor’s language specifies, this inquiry is into the amount of the original work that was used, and is not supposed to look at how much of the new work is made up of the original work.
In general, the smaller the amounts taken from the plaintiff’s work, the more likely the new use is fair (and vice versa). Copilot uses the entirety of its training examples. However, “the extent of permissible copying varies with the purpose and character of the use.”150Campbell v. Acuff-Rose Music, Inc., 510 U.S. 569, 586–87 (1994). If the defendant’s transformative purpose requires that more of the work be used, then such an amount and substantiality will not necessarily weigh against fair use.151See Google LLC v. Oracle Am., Inc., 141 S. Ct. 1183, 1205 (2021) (“The ‘substantiality’ factor will generally weigh in favor of fair use where . . . the amount of copying was tethered to a valid, and transformative, purpose.”). The result for this factor is thus a function of the result in the first factor. If a court credits Copilot’s use as transformative, then this factor will likely tilt in favor of fair use for two reasons. First, ML and AI work best when they are trained on large amounts of data.152See Sobel, supra note 92, at 58. Thus, using the entirety of the work is tethered to the purpose of creating an effective tool to suggest code to programmers. Second, Copilot probably needs the entirety of the training examples to extract usable information about how code is written. For one thing, source code often references itself. A function invoked may be declared and implemented elsewhere in the same file. If only the definition or only the invocation is used in training Copilot, then Copilot lacks accurate awareness of what code actually looks like. If Copilot is not credited as a transformative use, then this factor will presumably weigh against GitHub because they took the entire work.
c. Factor Four: Market Effects
The fourth and final fair use factor is “the effect of the use upon the potential market for or value of the copyrighted work.”15317 U.S.C. § 107(4). Courts consider the extent of harm to the market for the copied work and the work’s derivatives caused by the defendant’s use.154See Campbell v. Acuff-Rose Music, Inc., 510 U.S. 569, 590 (1994). Also reviewed is the market impact of many others using the work as has the defendant, a potential result of a fair use holding.155See id.; Google LLC v. Oracle Am., Inc., 141 S. Ct. 1183, 1207 (2021). Only certain types of harms are counted against a fair use finding, however, namely harms of substitution or usurpation where the defendant’s work can substitute for the plaintiff’s in the marketplace.156Patry on Copyright, § 10:150. Thus, this factor is harmonious with the first, because the more transformative a use, the less likely its output can act as a substitute.
The market effects inquiry is normally an outgrowth of copyright’s incentive model: that the optimal way to induce the creation of valuable works is to give authors exclusive rights over use of those works from which they can extract value.157See Rebecca Tushnet, Economies of Desire: Fair Use and Marketplace Assumptions, 51 WM. & MARY L. REV. 513, 517 (2009). By siphoning off some of the value from the work’s exploitation and reducing the value that flows to the original author, substitutive market harms “conflict with copyright’s basic objective.”158Google LLC v. Oracle Am., Inc., 141 S. Ct. 1183, 1206 (2021). Analysis of Copilot under this factor is seemingly unusual because the training examples are computer programs made publicly available. Per the liability scenario motivating this discussion, a legally problematic training example is likely made available under an open source license that forgoes money in favor of other considerations.159Supra, § II.B. That theory of liability depends on an understanding that an open source author extracts “substantial . . . economic benefits” from their public offering of their code.160Jacobsen v. Katzer, 535 F.3d 1373, 1379 (Fed. Cir. 2008). Open source authors may use a publicly available component to generate leads for a broader, closed source project; improve their reputation in the industry; or obtain fast and free project improvements from the community.161Id. In addition, GitHub itself allows users to contribute money to open source authors. See Devon Zuegel, Announcing GitHub Sponsors, THE GITHUB BLOG (May 23, 2019), https://github.blog/2019-05-23-announcing-github-sponsors-a-new-way-to-contribute-to-open-source/ (last visited Dec. 13, 2021); GitHub Sponsors, GITHUB, https://github.com/sponsors (last visited Dec. 13, 2021). Although not a typical transaction, these considerations represent value extracted from an author’s work, and substitution still interferes with that aspect of copyright’s incentive model. In effect, the license-violation theory of copyright liability fits neatly with factor four’s economic focus. Analysis of Copilot under this factor should thus ask the conventional question of whether its use of the training examples leads to market substitution that interferes with value extraction by the examples’ authors.
Copilot itself does not represent a substitutive harm, unless a training example is taken from a program that also suggests code to programmers as they work. The author of such a training example has a simple argument that GitHub’s use of their work was used to create a market substitute. Very few training examples will be in this group; the vast majority will perform functions completely different from Copilot. In that case, Copilot itself does not present a harm of substitution.
More complicated is how a court will consider Copilot’s later use by programmers. The optics of GitHub’s use are not great. GitHub took advantage of the public availability of open source by using it to build a product that can help a later programmer avoid resorting to use of the original author’s work by writing their own program. But this overlooks nuance and extends GitHub’s responsibility too far. For one program to serve the same market as another program, it must overlap in ultimate functionality. Thus, use of Copilot would only present a cognizable factor four injury to an example’s author if the user is writing a program with functionality similar to the example. However, a fair use is not rendered unfair simply because the defendant is wading into the same competitive waters as the plaintiff.162See Sega Enterprises Ltd. v. Accolade, Inc., 977 F.2d 1510, 1523 (9th Cir. 1992), as amended (Jan. 6, 1993). The problematic substitution story is this: rather than use an open source author’s project, a programmer will decide to write their own program that does the same thing, and will use Copilot while doing so. However, while courts will review Copilot’s output, they will not impute GitHub with liability for what users do with that output. In the substitution scenario, the source of the substitution comes from the user deciding to write their own version, not Copilot’s use of a training example. Copilot does not give the user a substitute because it does not output entire nontrivial programs.163See infra, § III.B.1. It attempts to glean the user’s short-term programming goal and suggests small amounts of code to accomplish that goal.164Id. It rarely shows verbatim copies, and probably cannot be used to access substantial amounts of any training example.165Id. See Authors Guild v. Google, Inc., 804 F.3d 202, 224 (2d Cir. 2015) (finding that the Google Books snippet view was not a market substitute for a book because the snippets were short and made available in a “cumbersome, disjointed, and incomplete” fashion).
The trouble in assessing Copilot’s substitutive market effect stems from its nature as a tool. A program that presents a cognizable copyright harm to a training example author is a blend of contributions from the user and Copilot. Typically, factor four looks at the defendant’s use itself or that use’s outputs and asks whether they represent copyright harms in and of themselves. Here, since Copilot’s outputs are unlikely to be substitutes on their own, the question remaining is whether GitHub should be held responsible for choices made by its users. But that is better understood as secondary copyright liability, and is discussed separately later.166See infra, § III.C. Copilot can be understood as an instrument of programmers, like the text editor in which they program or the debugging tools in their IDE. Another view sees Copilot poisoning every program containing its suggestions, because the author has exploited the training examples to “avoid the drudgery in working up something fresh.”167Campbell v. Acuff-Rose Music, Inc., 510 U.S. 569, 580 (1994). This latter view may feel righteous, but it expands GitHub’s liability too widely. The company may slip the knot of substitution harms because the user is the source of the choice to author a substitute.
Still, while it is difficult to lay the substitution at Copilot’s feet, the tool being used to achieve the substitution was built by using the training examples. Generative AI applications like Copilot present grander threats to the potential market for human authorship than do typical uses.168See Sobel supra, note 92, at 57. These applications may obsolesce human authors and “replac[e] them with cheaper, more efficient automata.”169Id. But analysis of market effects should exclude speculative harms. In Oracle, the Court found that Android did not usurp any market for Java SE, because Oracle’s prospects for success in the mobile phone market were dim.170Google LLC v. Oracle Am., Inc., 141 S. Ct. 1183, 1206 (2021). A technology company like Oracle successfully entering the mobile phone market seems more likely than a mass replacement of humans by machines in writing computer programs, at least in the reasonably foreseeable future.
As a final point, the licensing market for the training examples should be considered. Within factor four, courts entertain the loss of licensing revenues that a plaintiff may have suffered by the defendant’s unauthorized use of the plaintiff’s work. This is limited to “traditional, reasonable, or likely to be developed [licensing] markets.”171Am. Geophysical Union v. Texaco Inc., 60 F.3d 913, 929–930 (2d Cir. 1994) Although it is novel, there is a widespread market for training data: the exchange of free services provided by technology companies for the rights to user-generated content.172See Sobel supra, note 92, at 77. It is reasonable to envision a world where users are entitled to compensation for use of their works in training ML models. If that market is considered in the fair use analysis, it could count against GitHub’s use of the training examples.
B. User's Direct Infringement
A Copilot user may directly infringe copyright by using the tool. As mentioned above, in rare cases Copilot will suggest verbatim code from its training data. For argument’s sake, assume Copilot suggests some verbatim code licensed under a copyleft license. If the user accepts the suggestion, they have made a copy of a portion of the original work.173Even if the suggestion is not accepted, its appearance on the screen means that a copy exists in the computer’s memory. The user may then release their computer program without adhering to the terms of the original work’s license, potentially exposing them to liability for infringing the original author’s reproduction right.174See Jacobsen v. Katzer, 535 F.3d 1373, 1382–83 (Fed. Cir. 2008). This direct infringement scenario is unlikely, because GitHub’s research indicates that only 0.1% of suggestions are verbatim copies, and that most such outputs may not pose copyright issues.175Copilot Product Page, supra note 28; Albert Ziegler, GitHub Copilot Research Recitation, GITHUB DOCS, https://docs.github.com/en/github/copilot/research-recitation (last visited Dec. 12, 2021) [hereinafter Copilot research recitation] (reviewing verbatim suggestions above a threshold length and finding that they included: long repetitive sequences; “standard inventories” like stock market tickers and the Greek alphabet; code found in more than ten training examples). However, the high statutory damages awarded per infringed work render even small numbers of infringements problematic.
A lack of intent by the user to copy the training example does not preclude liability for infringement. “[T]he general proposition is that innocent intent is no defense to copyright infringement.”1764 Nimmer on Copyright § 13.08 (2021). Even unconscious copying carries with it the specter of infringement liability.177See e.g. Bright Tunes Music Corp. v. Harrisongs Music, Ltd., 420 F. Supp. 177 (S.D.N.Y. 1976), aff'd sub nom. ABKCO Music, Inc. v. Harrisongs Music, Ltd., 722 F.2d 988 (2d Cir. 1983). Independent creation is a defense to infringement,178Actual copying in fact of the allegedly infringed work is a required element of infringement actions, so independent creation without copying precludes liability. See Feist Publications, Inc. v. Rural Tel. Serv. Co., 499 U.S. 340, 361 (1991). but the acceptance of a Copilot suggestion is not independent creation. The user may argue that GitHub is the actual instigator of the copying, and the company should accordingly be held the direct infringer in lieu of the unwitting user. But in cases where a system that programmatically makes copies is invoked and directed by a user, direct infringement typically rests with the “volitional” actor using the system.179See Religious Tech. Ctr. v. Netcom On-Line Commc'n Servs., Inc., 907 F. Supp. 1361, 1370 (N.D. Cal. 1995). The Copilot user is likely to be considered that volitional actor. In one illustrative case, the Second Circuit assessed whether a cable company directly infringed copyright by operating a service wherein users could record television programs with a single button press on a handheld remote.180Cartoon Network LP, LLLP v. CSC Holdings, Inc., 536 F.3d 121, 124 (2d Cir. 2008). Copies of recorded programs were stored on the cable company’s servers.181Id. The court focused its direct infringement inquiry on “the volitional conduct that causes the copy to be made,” and found such conduct in the user who “issu[es] a command directly to a system, which automatically obeys commands and engages in no volitional conduct.”182Id. at 131. The notion flowing from cases like this is that where a user instructs or causes a system to make infringing copies, the user is the direct infringer. The Copilot user likely supplies enough volition to be considered the direct infringer for a copy made by accepting suggested code. The button-pressing user in Cartoon Network participated in the copying process much less than a Copilot user, who must write some source code to trigger the suggestion feature and then accept the suggestion. Because the Cartoon Network user was found to be the infringer, it follows that the Copilot user will, too.
In spite of the prima facie direct infringement by the Copilot user, liability may not follow for a few reasons. First, the copied code may not even be protectable under copyright concepts like merger doctrine and scène à faire. Second, the amount of code copied by Copilot may be so small as to be considered legally de minimis. Third, the user can argue their use of the code was fair.
1. User's Substantial Similarity Argument
A copying Copilot user may argue that they have not infringed at all. Copyright infringement plaintiffs must show that they own a valid copyright, and that the defendant copied original elements of the plaintiff’s work.183Feist Publications, Inc. v. Rural Tel. Serv. Co., 499 U.S. 340, 361 (1991). The latter requirement is made up of two inquiries: whether the defendant actually copied the plaintiff’s work (copying in fact), and if so, whether such copying is actionable – that is, whether the defendant’s work is substantially similar to the plaintiff’s work.184See 4 Nimmer on Copyright § 13.01[B] (2021). Not all copying is unlawful. Here, copying in fact is satisfied by the original code’s inclusion in Copilot’s training data, followed by the user’s later acceptance of the copied suggestion. Next, we must assess whether such copying is extensive enough to render the training example and the user’s program substantially similar. This judgment involves arbitrary line drawing, and circuits differ in their approaches. In general, courts scrutinize the copying through both qualitative and quantitative lenses, taking both the original work and the allegedly infringing work into account.
Copilot mostly suggests small amounts of code.185See Copilot product page, supra note 27 (displaying examples of roughly fifteen lines of code); Video of Example Use of Copilot and Resulting Code from GitHub, GITHUB, https://github.com/github/copilot-docs/tree/main/gallery/python-discovering-emails-in-screenshot (last visited Dec. 12, 2021) (depicting similarly sized code suggestions when using Copilot). Even if the technology advances, this will not likely change because of how the user and Copilot interact. The tool cannot read the user’s mind. It waits for an indication of the user’s current programming goal, and then suggests code that it predicts will achieve that goal. Evidence of the user’s goal can be found, for example, in the current line of code being worked on, a comment explaining what the code about to be written below it will do, or a descriptive function name. One can imagine a user evincing a very broad goal, like a comment stating that a file contains an entire spreadsheet program, but the number of possible implementations between that high-level purpose and the actual source code renders use of Copilot impractical. A practical goal that Copilot can use is necessarily limited in scope. As discussed above, writing code often takes the form of an organizational task wherein the overarching purpose is subdivided into various forms like classes and functions. This process provides small and usable goals to Copilot as the user successively writes files, classes, and functions. Such a usage pattern bounds the size of suggestions since the user’s goal only becomes sufficiently clear when a relatively small amount of code is being (or about to be) written.
Recall that section 102(b) excludes ideas, procedures, and processes from copyright protection. Putting this into action when evaluating whether an act of copying is unlawful requires ensuring that unprotectable aspects of the allegedly infringed work are excluded from the ultimate comparison between it and the allegedly infringing work. In determining whether a Copilot user’s computer program is substantially similar to the training example program, only protectable expression found in the training example is to be considered. Generally, distilling a computer program to its protectable elements and assessing substantial similarity is accomplished by the widely-used Altai test.186See Computer Assocs. Int'l, Inc. v. Altai, Inc., 982 F.2d 693 (2d Cir. 1992). The test proceeds in three steps: specifying the copied part of the plaintiff’s program at various levels of abstraction, ranging from the source code to the program’s overall function; at each level, filtering out unprotectable elements; finally, comparing the remaining protectable expression with the defendant’s program.187Id. at 706. This test “breaks no new ground,” but is applying existing copyright doctrines to computer programs188Id.; perhaps code requires a particularized explanation due to most people’s unfamiliarity with the subject. A copying Copilot user can resist substantial similarity by arguing that some or all of the copied code should be filtered out by various copyright doctrines in the second step of the Altai test. After these attacks, there may be no remaining protectable expression in the copied code, or so little as to not be actionable.
As a preliminary matter, breaking down small segments of code into levels of abstraction as dictated by step one of the Altai test is a quick matter. A small amount of code will have a narrower range of possible abstractions than an entire computer program. Both have an ultimate purpose and source code, but a small amount of code may have only one intermediate abstraction, such as a high-level description of the steps it is performing.189For example, an abstraction of a small function might be a series of high-level steps: validate the function input, fetch data from some source, check the fetched data, and return the data formatted in a particular manner. This contrasts with larger amounts of code, which can be described by higher-level interactions of modules of code, like the interactions between the user-facing website code of a service like Facebook and the system that stores Facebook user data. The highest level of abstraction, the purpose or goal of the code, is unprotectable as an idea.19017 U.S.C. § 102(b). Intermediate abstractions of a small amount of code are likely to be unprotectable ideas as well, because there is insufficient space between the poles of abstraction to include protectable aspects of computer programs like module design, parameter lists, and the flow of data between modules.191See e.g. M-I L.L.C. v. Q'Max Sols., Inc., No. CV H-18-1099, 2020 WL 4549210 (S.D. Tex. Aug. 6, 2020) (holding that algorithm consisting of four sub-steps is unprotectable as a process). Accordingly, a substantial similarity analysis where the alleged infringer copied a small amount of code likely collapses to an analysis of the source code alone.
The user may argue that copyright’s merger doctrine removes some or all of the copied code from protection. Merger doctrine says that if an idea is only expressible in one or a limited number of ways, then expression of the idea is denied copyright protection to avoid the consequence of violating the idea-expression dichotomy.1924 Nimmer on Copyright § 13.03[B][a] (2021). The doctrine was explicitly applied to computer programs by the Altai court and characterized as excluding from protection those elements of a program dictated by efficiency.193Computer Assocs. Int'l, Inc. v. Altai, Inc., 982 F.2d 693, 708 (2d Cir. 1992). This view of merger flows from the fact that efficiency, both in terms of time and hardware resource use, is a goal of computer programs generally and benefits program users. While there will technically be many ways to write a piece of code, the desire for efficiency can reduce the number of possible choices significantly. This reduction in possible expressions implicates merger doctrine and may block protection for some expression in the code, lest copyright lock up the only efficient way to implement an unprotectable function.
Application of merger to small amounts of code is particularly fitting because its size limits the choices for expression even before efficiency considerations reduce creative choice further. At the extreme, a single line of code accomplishing some unprotectable idea may have only one sensible implementation. There are very few ways to sum two numbers, invoke a particular function, or assign a value to a variable. As the volume of code grows, the number of possible expressions of the code’s ultimate function increases in tandem; eventually, some threshold is crossed and a programmer’s code breaks free from merger and becomes protectable expression. But insofar as Copilot only suggests small segments of code, merger weighs upon the protectability of its suggestions.194In rare cases a small amount of code can be extremely creative, such as the head-scratching short algorithm for computing the inverse of the square root of a number used in a video game whose source code was eventually released. Fast Inverse Square Root, WIKIPEDIA, https://en.wikipedia.org/wiki/Fast_inverse_square_root#Overview_of_the_code (last visited Dec. 12, 2021).
The protectable elements of code shrink further insofar as code’s expression is influenced by factors external to the author’s creativity. Altai brought such factors under the umbrella of scène à faire195Altai, 982 F.2d at 709–10., a copyright doctrine that permits the copying of stock elements or standard techniques within a particular type or genre of work1964 Nimmer on Copyright § 13.03[b] (2021).. In computer programs, this is reflected in constraints imposed by things like interoperability with other systems, the hardware on which the program will run, and industry demand. For example, if interoperability with a separate system requires the inclusion of a particular piece of code, then every program written to be compatible with that separate program will necessarily include that code. Copying that code would not be infringement. The Altai court applied scène à faire and declared unprotectable elements of the allegedly infringed work that were “obvious,” because they “follow naturally from the work's theme rather than from the author's creativity.”197Altai, 982 F.2d at 715 (quoting 3 Nimmer on Copyright § 13.03 [F] (1991)). If the “theme” of a small amount of code is the limited functionality that the code implements, then the code’s expression might similarly owe its origin to the functionality rather than the author. However, if a more expansive genre or theme is considered, like the program’s ultimate function, then a small piece of code is much less likely to be considered stock or standard.
Style conventions for a given programming language also influence source code. For example, one style guide for the programming language C++ tells programmers not to declare a variable before you actually use it, to avoid variable names of entirely capitalized letters, and to write functions that perform a “single logical operation.”198C++ Core Guidlines, GITHUB, https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md (last visited Dec. 12, 2021). Open source code authors like the Copilot training example authors have an incentive to conform to a popular style convention. Public release of their code provides value by enhancing their reputation and improving code quality by inviting and accepting contributions from the community.199See Jacobsen v. Katzer, 535 F.3d 1373, 1379 (Fed. Cir. 2008). Adhering to style conventions makes realization of that value more likely by reducing friction for others engaging with the project. For a copied Copilot suggestion, elements flowing from style conventions may be filtered out from the substantial similarity comparison because they are not original to the training example’s author.
What protectable elements are left after application of these various copyright doctrines? Each potentially removes aspects of the code from the similarity comparison. As discussed previously, this filtering process is being performed on a small amount of code because of Copilot’s usage pattern. Copying any remaining protectable elements is likely to be legally inert under copyright doctrine because it is de minimis. Ultimately, when a Copilot user copies a relatively small amount of code by accepting a suggestion that is a verbatim (or nearly verbatim) copy from Copilot’s training data, it is unlikely that the original and later works are substantially similar.
Courts have assessed de minimis copying in the software context. A given act of copying might avoid a substantial similarity finding because of the copied portion’s qualitative or quantitative insignificance to the plaintiff’s overall program. In one case, a defendant’s copying of eight out of sixty-four data field names from a database was found not to constitute substantial copying.200InfoDeli, LLC v. W. Robidoux, Inc., No. 4:15-CV-00364-BCW, 2017 WL 11517565 (W.D. Mo. Dec. 18, 2017). See also Digital Drilling Data Sys., L.L.C. v. Petrolink Servs., Inc., 965 F.3d 365, 375 (5th Cir. 2020) (upholding summary judgment for defendant on the basis that no reasonable jury could find substantial similarity based on defendant copying five percent of plaintiff’s database schema). Another defendant was granted summary judgment when they copied forty-four lines of code from the plaintiff’s program of two million, and the copied lines were not shown to be qualitatively important.201M-I L.L.C. v. Q'Max Sols., Inc., No. CV H-18-1099, 2020 WL 4549210 at 9 (S.D. Tex. Aug. 6, 2020). Copying four protectable features of a wood trussing program was held not to warrant a finding of substantial similarity because the plaintiff had not shown “the significance of the copied features.”202MiTek Holdings, Inc. v. Arce Eng'g Co., 89 F.3d 1548, 1560 (11th Cir. 1996). In contrast, a court rejected a de minimis argument where the programs at issue consisted of hundreds of thousands of lines of code, a quantitatively small 437 lines were copied, but the lines “perform[ed] a real function.”203Mktg. Tech. Sols., Inc. v. Medizine LLC, No. 09 CIV. 8122 (LMM), 2010 WL 2034404 at 3 (S.D.N.Y. May 18, 2010).
It is likely that the protectable elements of small portions of code copied with Copilot will not suffice to justify a finding of substantial similarity. Short code segments like those suggested by Copilot are generally unimportant. Any qualitative significance they may possess is limited by their size; you simply cannot do very much with fifteen lines of code. Tautologically, quantitative significance of the copied code to the overall program of the training example will probably be lacking because the copied portion is likely to be small. Strange scenarios that are helpful to an infringement plaintiff may occur, such as Copilot suggesting a larger block of code, or the Copilot user’s ultimate program consisting primarily of the copied suggestion.204Violation of a copyleft license often requires distribution of the program to the public without providing the source code. Why would a Copilot user distribute a program consisting of only the Copilot-copied snippet? The simple answer is that they would not even do such a thing. But those are extremely unlikely. The most common scenario will be a Copilot user unwittingly copying a few lines of unimportant code into a much larger program. A finding of substantial similarity of the two programs will accordingly be unlikely, precluding any liability for copying via Copilot.
2. User's Defense: Fair Use
a. Factor One
Even if their substantial similarity argument fails, a user who directly infringes a training example by accepting a portion of the example’s code suggested by Copilot can still lean on the fair use defense. The first thing to assess is whether the user’s use of the copied segment of code is transformative. Recall that the prima facie infringement case presented involves the user copying a small amount of code verbatim from the training example because it was suggested by Copilot. The Supreme Court in Oracle indicated that in the context of computer programs, the purpose and character of copying is not identical to the original work just because both are meant to be executed by computers.205Google LLC v. Oracle Am., Inc., 141 S. Ct. 1183, 1203 (2021). They then held that copying declaring code from one program into another is transformative when it is done judiciously to create a new product for use by programmers.206Id. at 1203–04. The Copilot user is performing a less powerful version of what Google did in creating Android. Android was an entire new computing platform with which programmers would create many new programs in the form of smartphone applications. A user who copies a bit of code from a training example is creating their own new product, albeit one that is less likely to facilitate further expression like Android.
Copying a small piece of one program and putting it to a new purpose—the new program’s functionality—ostensibly fits within the idea of transformative purpose. In one case, reproducing an entire copyrighted modeling photograph in a newspaper was held transformative because it turned the photograph into news and used it for a “further purpose.”207Nunez v. Caribbean Int'l News Corp., 235 F.3d 18, 24 (1st Cir. 2000) (quoting Campbell v. Acuff-Rose Music, Inc., 510 U.S. 569, 579 (1994)). Copying code originally written as part of one program and employing it in service of an entirely different program is similar, and has a legitimate claim to transformativeness. If the second program has similar functionality as the first, this claim becomes more tenuous.
On the other hand, the Copilot user has to justify the copying.208See Authors Guild v. Google, Inc., 804 F.3d 202, 215 (2d Cir. 2015). Justification is impossible where the user was unaware they were accepting a verbatim copy of a training example.209GitHub has indicated they may add to Copilot a visual indicator when it is suggesting a verbatim copy. See Copilot Research Recitation, supra note 197 (“When a suggestion contains snippets copied from the training set, the UI should simply tell you where it’s quoted from.”). Even if they were aware, the small amount of copying is hard to justify for a reason other than the copyright-offensive goal of being convenient.210See Authors Guild v. Google, Inc., 804 F.3d 202, 215 (2d Cir. 2015) (“A secondary author is not necessarily at liberty to make wholesale takings of the original author's expression merely because of how well the original author's expression would convey the secondary author's different message.”); Google LLC v. Oracle Am., Inc., 141 S. Ct. 1183, 1203 (2021) (finding that Google copied only as was needed to ease Java programmers’ transition to Android). This method of copying may help a user’s factor one outcome, however, because of the user’s ignorance. Innocent infringement does not preclude prima facie liability, but fair use can take account of “the purpose and character” of the defendant’s use.21117 U.S.C. § 107(1). Here the copy was suggested by Copilot, where only 0.1% of suggestions are verbatim copies. It might be considered unreasonable to place upon the user responsibility for finding the source of a Copilot suggestion, not least because the user cannot distinguish between a merely unsuccessful search and a futile search for a suggestion that is not even a copy.
Copying small amounts of code into one’s own program is common practice and facilitates the creation of new programs. A popular online destination for programmers is Stack Overflow, a question-and-answer website where users ask and answer questions about code.212Stack Overflow Home Page, STACK OVERFLOW, https://stackoverflow.com/ (last visited Dec. 15, 2021). Users ask questions, like how to accomplish a task in a particular programming language, and other users give answers which yet more users vote on. Copying from Stack Overflow is so common that the company once ran an April Fools joke where they claimed to be selling a tiny keyboard that streamlined copying.213Ben Popper, Introducing the Key, STACK OVERFLOW BLOG (Mar. 31, 2021), https://stackoverflow.blog/2021/03/31/the-key-copy-paste/ (last visited Dec. 15, 2021). Behind this gag is data: “[o]ne out of every four users who visits a Stack Overflow question copies something within five minutes of hitting the page.” Ben Popper & David Gibson, How often do people actually copy and paste from Stack Overflow? Now we know., STACK OVERFLOW BLOG (Apr. 19, 2021), https://stackoverflow.blog/2021/04/19/how-often-do-people-actually-copy-and-paste-from-stack-overflow-now-we-know/ (last visited Dec. 15, 2021). Frequent sharing and copying of small amounts of code accomplishing discrete programming tasks is a common practice in the industry. That a programming practice is common was noted as a factor favoring fair use in Oracle when assessing the purpose and character of the use.214Oracle, 141 S. Ct. at 1204.
Copying small amounts of code, with or without Copilot, seems to both offend fair use as a mere convenience without justification and be excusable as common in the industry. If a copying user’s work performs a different function than the copied training example, the facilitation of a new work may result in a factor one finding in favor of fair use.
b. Factor Four, and the Rest of the Gang
A Copilot user’s own program will only present cognizable copyright harms to the program it copied from if both programs perform a similar function. While the copied code itself will technically perform the same function in both programs, the proper inquiry will be into the functionality of the entire programs. The author of the training example is not extracting value from a few lines of code of their project, but rather the whole work or some larger module of their work. If the allegedly infringing program is used for some completely different purpose, there is no danger of substitution and thus copyright’s incentive model remains undisturbed. In that case, this factor should weigh in favor of fair use.
Factor two, the nature of the copyrighted work, should help the Copilot user. The copied work is source code, which permits more copying than works closer to the core of copyright like paintings and books. Factor three, the amount and substantiality copied, is also dealt with quickly. In this scenario, the user is copying a very small amount of code. This is likely a small and unimportant part of the training example. Large programs can comprise millions of lines of code. The value of code is more an emergent phenomenon rising out of many discrete lines of source code. Fifteen lines from any program are unlikely to possess any qualitative importance.
Overall, the user’s fair use defense hinges on unknowns. In particular, their case is much weaker if the ultimate functionality of the user’s program and the copied program must overlap. If that were true, the user’s claim to transformativeness would be weakened, and a risk of market substitution would present itself. A Copilot user risks such functional overlap when they accept a suggestion without knowing the source of a suggestion. The degree of sympathy a court may have for a user suggested copied code by Copilot and unable to discover its source may also play a role.
C. GitHub's Secondary Liability
Device makers whose products are used to infringe copyright face secondary liability under copyright law. To hold a device maker secondarily liable, plaintiffs must first establish that there have been acts of direct infringement by the device’s users. Thus, the possibility of GitHub being found secondarily liable for Copilot’s use depends on the preliminary question of whether Copilot users are direct infringers with respect to the copyrighted training examples. The previous section explored that question in detail. Given the uncertainty of the answer, it is worth considering the question of secondary liability.
Sony v. Universal provides a defense to secondary liability, as well as guidance in assessing GitHub’s situation with Copilot.215Sony Corp. of Am. v. Universal City Studios, Inc., 464 U.S. 417 (1984). Sony was selling VCRs which gave users the ability to make copies of television programming, some of which was plaintiffs’ copyrighted works, onto reusable tapes.216Id. at 422, 434. The Court noted that the devices were put to various uses, some of which were unobjectionable or authorized by copyright owners.217Id. at 424. It ultimately declined to give the plaintiffs the right to enjoin the device altogether, holding that “the sale of copying equipment . . . does not constitute contributory infringement if the product is . . . capable of substantial noninfringing uses.”218Id. at 421, 424.
Sony established a safe harbor for device makers where the device is capable of substantial noninfringing uses. While there is no bright line proportion of a device’s uses that satisfies this standard, Copilot likely satisfies. GitHub has handily assessed how often Copilot suggests verbatim code. The vast majority of code that Copilot suggests is not found verbatim in its training data.219Ziegler, supra, note 192. In fact, only 0.1% of suggestions are verbatim copies.220Id. Some suggestions could be inexact but substantial copies of training examples, increasing from 0.1 the percent of uses that are ostensibly infringing. But some of those are unlikely to rise to infringement because they are small enough to avoid liability as de minimis copying.221See supra, § III.B.1. It seems unlikely that Copilot would suggest infringing substantially similar copies frequently enough to increase the proportion of its uses that are infringing so as to preclude application of the Sony safe harbor. Thus, it is unlikely that GitHub would face secondary liability for Copilot.
D. Wrapping Up
Attempts have been made to find a through line among decisions like those outlined above. These seek to explain past decisions and prescribe a framework for deciding future cases in a manner that aligns with copyright’s goals. Sag begins with an axiom: that the touchstone of copyright is communication of an author’s expression to the public.222Matthew Sag, Copyright and Copy-Reliant Technology, 103 NW. U. L. REV. 1607, 1628 (2009). Uses of copyrighted works are then bucketed into two categories: expressive or nonexpressive. Expressive uses “relate to, and are motivated by, the expression embedded within a copyrighted work.”223Id. at 1624. Think reading a book, distributing a movie, or looking at a painting. Nonexpressive uses are “do not communicate the author's original expression to the public.”224Id. at 1625. Generally, because only expressive uses have the potential to act as a substitute for expressive communication of a work, only they should constitute infringement.225Id. at 1628. Conversely, nonexpressive uses of a work should not be infringing.226Id. at 1626. Sag uses this framework to explain copyright decisions concerning “copy-reliant technologies,” systems that copy works “routinely, automatically, and indiscriminately” for nonexpressive uses.227Id. at 1608. He explores technologies that make nonexpressive uses of copyrighted works include search engines228Sag, supra note 222, at 1618. and plagiarism detection software229Id. at 1623.. Ultimately, Sag advocates fair use as the appropriate doctrinal home for recognition that nonexpressive uses should not be infringement, because the purpose of the use is such that the use does not “merely supersede the objects of the original creation” and thus does not pose a danger of substitution in a cognizable copyright market.230Id. at 1645, 1647 (citing Campbell v. Acuff-Rose Music, Inc., 510 U.S. 569, 583 (1994)), 1656.
Application of Sag’s nonexpressive use by copy-reliant technologies to Copilot is appealing, because Copilot systematically copies a vast number of copyrighted works. But doing so to predict that Copilot will avoid liability is unsatisfactory for two reasons. First, the theory predates the explosion of machine learning and AI.231See Lemley & Casey, supra note 75, at n. 9. That copy-reliant technologies make nonexpressive use of copyrighted works cannot be assumed, because systems like Copilot exist to generate new works, and arguably use training examples for their expressive content. Second, Sag admits that software should be treated exceptionally within his framework.232Sag, supra note 222, at 1637. A typical use of software is not an expressive use, but functional.233Id. Thus, the axiom that copyright’s touchstone is communicating expression to the public is called into question. Software’s status as a “non-native species” thus precludes its inclusion in his theory for works “indigenous to copyright.”234Id. at 1638.
Lemley and Casey argue for a doctrine of “fair learning” that would generally consider ML or AI training a fair use.235Lemley & Casey, supra note 75, at 776. More specifically, such use should be presumptively fair if the purpose of the use by the system is to access and learn from the unprotectable elements of a work, rather than to “obtain or incorporate the copyrightable elements.”236Id. This non-expressive use is like the copying necessary to access unprotectable functional aspects of code in reverse engineering cases. See id. at 761–62. This recalls the Sega reasoning discussed above, where copying as a necessary step to accessing the unprotectable interoperability code of video games was held a fair use.237Supra, § III.A.1. In many ways this aligns with the concept of nonexpressive use laid out by Sag: there is a poor “fit between what the law protects and what the [ML system] wants,” because the ML system does not copy works for a “copyright-related reason.”238Lemley & Casey, supra note 75, at 750, 772. They offer the example of a self-driving car trained on photographs caring primarily about the unprotectable fact that a photograph depicts a stop sign, and not caring about the photograph’s protectable elements like lighting and composition.239Id. at 772. However, ML systems whose training may use the expressive, protectable elements of the training examples, and whose output is an expressive work of the same kind as the training data, are said to present a harder fair use argument.240Id. at 777. Copilot is such a system, and thus again eludes clear inclusion in a coherent theory.
Sobel finds fair use ill-prepared to handle “expressive machine learning” which can access and use the protectable, expressive elements of its training data and then output entire expressive works.241See Sobel supra, note 92, at 57. Non-expressive use relies on the absence of both of those capabilities.242Id. He points out that these features of newer ML applications present a thorny problem for fair use to continue its role benefiting the public.243Id. at 79. But the systems Sobel discusses can be distinguished from Copilot, because those systems output entire works. While Copilot can arguably make use of the expression in training data and output expression itself, it does not currently present a threat to programmers as a group. It augments programmers by helping them perform their work. This notion tempers the factor four concerns of expressive ML serving in the place of human authors. ML still presents a sticky case for fair use, but perhaps Copilot is less problematic for that reason.
Copilot is not the first generative AI system and will not be the last. Machine learning systems present yet another “significant change in technology” that copyright may be asked to confront.244Sony Corp. of Am. v. Universal City Studios, Inc., 464 U.S. 417, 430 (1984). ML systems that both make use of and output expression present perplexing questions that doctrines like fair use may be ill-equipped to answer. Systems that output entire expressive works might present an easier case for courts to address, but where future authors use a system like Copilot in combination with their own expression to create a work, the case becomes harder. As ML has become more useful, it may end up following the “don’t ask for permission, ask for forgiveness” path trod by other copy-reliant technologies like search engines. After its value has been demonstrated, a court may be hesitant to effectively crush a nascent technology by shouldering its builders with copyright liability and the future responsibility to pay for its numerous input data. Copilot’s usefulness is becoming clearer; GitHub claims that for some programming languages, thirty percent of code written is suggested by Copilot.245Bryan Walsh, GitHub Sees Uptick in Coders Using AI Assistant, Axios (Oct. 27, 2021), https://www.axios.com/copilot-artificial-intelligence-coding-github-9a202f40-9af7-4786-9dcb-b678683b360f.html (last visited Dec. 16, 2021). Perhaps the cat is already out of the bag.