Episode 15: How to protect training data used in the creation of AI systems

Lisa Desjardins: You’re listening to Canadian IP Voices, a podcast where we talk intellectual property with a range of professionals and stakeholders across Canada and abroad. Whether you are an entrepreneur, artist, inventor or just curious, you will learn about some of the real problems and get real solutions for how trademarks, patents, copyrights, industrial designs and trade secrets work in real life. I’m Lisa Desjardins and I‘m your host. The views and opinions expressed in this podcast are those of the individual podcasters and do not necessarily reflect the official policy or position of the Canadian Intellectual Property Office.

The most traditional parts to knowledge is by studying existing information, drawing conclusions, maybe testing and then applying the knowledge in new situations. There’s very little question of who owns, for example, the skills or music of a guitar player or the money an investor has made through applying their new knowledge learned from a book or a course. But when we use existing information to train computers, or artificial intelligence systems, to, for example, create music or invest, there actually are some unanswered questions as to who owns what. Here, we need to understand how AI systems are built and to what extent IP rights can protect these different building blocks.

To help us understand the current issues in this space, I am pleased to welcome Paul Gagnon and Misha Benjamin to our podcast. Both Paul and Misha have studied law, worked at law firms and are now engaged as in-house legal counsels, helping firms make decisions about how to protect data and artificial intelligence. They have given presentations on artificial intelligence and data markets, spoken in front of the Parliament to request better clarity on copyright protection and how the data is used by online intermediaries. Paul and Misha, welcome to this podcast.

Misha Benjamin: Thank you. Great to be here.

Paul Gagnon: Great to be here. Thanks.

Lisa: Misha, I‘ll start with you. Can you talk a little bit about yourself and the kind of work you’re involved in?

Misha: Absolutely. I think by the time this airs, I will have switched positions, but I will be general counsel at a company called Sama. What Sama does is they offer labelling services for data as well as data preparation and data engineering services and solutions. Prior to that, I was at McKinsey, where I supported contracting and our policies related to large enterprise AI deployments in particular. Before that, I was at Element AI, where I was involved in all aspects of commercialization and production of artificial intelligence software.

Lisa: Interesting. Paul, I’d like to turn it over to you now. You’ve been named one of the world’s 300 leading IP strategists. Can you tell us about your work?

Paul: My work today consists of a number of different mandates as assistant general counsel for Taiga Motors. Taiga is a company that builds electric power sports vehicles, like snowmobiles and personal watercraft. Prior to that, I worked for another tech company, but before that, I was at Element AI with Misha. I’ve also had experience working at Cirque du Soleil as the tech and digital attorney there as well as Intel and with a law firm. So, my background is really intellectual property but also commercial and more general corporate matters as well.

Lisa: Today, we’re going to talk about the difficulties in AI systems and the different components in an AI system. I was wondering, Misha, could you take us through the process of creating an AI system, like the ones being deployed in enterprise today? What are the steps in creating this?

Misha: Yes, absolutely. It’s definitely a multi-step process with a lot of different things going into it. I think the key thing to understand is that it’s a combination of software, which is generally an AI model plus a lot of infrastructure on it, as well as data. I’ll get into how those two points interact a bit more by giving a pretty typical example of what you could think of for an enterprise AI system. Let’s say you wanted to create an AI system that automatically inputs claims for car insurance. Let’s say you get into a car accident or a fender bender, nothing serious. You want to quickly have your insurance manage the claim, so you take a picture, you enter some information and you submit it to your insurer.

Then that will go through different steps within the insurer’s organization in order to decide on whether the claim should be paid and how much should be paid for. So it seems relatively simple when you think about it at a high level. But in order to deploy a system like that, there’s a bunch of different steps that have to happen. The first is, generally, the insurer will collect data that it already has, so information about past claims and whether they were paid or not and what were the motivating factors. That will often be supplemented by other data sources.

You can think about, let’s say images that come from vehicle recall websites or databases. Often synthetic data might be created as well if the existing data can’t be used and there’s no current existing data for that problem. That’s step one, collecting all the data. Number two is making sure that that data actually says something. The way that that’s done is you need to assign value to that data and that’s generally through some sort of labelling process. For instance, in images of a fender bender or a car accident, you could show points in different areas that are not the normal car’s envelope, and an AI would be able to recognize that there is an accident or something occurred here so that it’s not in [inaudible 00:06:11]. That’s the labelling process. Then there’s the training process. This is where the data meets the software. The software is basically taught by the data what to look for. It’s done by associating the data to the labels as well. This is one of the key steps. This is where a lot of computing power goes in, a lot of expertise goes in. And ultimately, what results from that is that the AI software, the model, is modified by the exposure to the data so that it generates an output that actually reflects reality. Then, obviously, there’s an output and the model will be put into production. Then there’s an ongoing monitoring to make sure that, one, the software is working properly and it’s doing what you expect it to do. Two, that, as the input data changes, so for instance, let’s say different types of cars come into the market or different accidents happen or the coverage changes, that’s reflected in the data and then reflected in the way the model operates as well. Then the outputs are monitored and can be put into production in different ways.

There’s a whole bunch of organizational change management that will happen with this as well to make sure that it’s done safely and securely and that nobody’s being discriminated against. Nobody’s getting huge payouts when they shouldn’t. The other thing is, generally, one system is put in place and if that data is well cleaned, well labelled and the AI system is doing what it’s supposed to, generally, the companies will then try to move to a higher level of automation or a more value-driven use case for that model and that data. For instance, they‘ll go from just routing the claim within the organization to creating a model that can give an expected payout for a given claim. So you‘ll go up in value across those use cases.

Lisa: There’s a lot of use of data. You go through and you label it. There is a lot of use of data that goes into a training model. I hear a lot of copying of data. Some of it might be incidental. I was going to ask you, Paul, you’ve been advocating for clarity on how data can be used. In 2019, you spoke in front of the Canadian Parliament, when the discussion about data use started in Canada. What are the issues here?

Paul: To Misha’s point is, we’re really at the beginning of the value chain around production and use of AI models. A lot of the data issues arise there. They’re not, obviously there’s data issues around other aspects of it, but really, if you hone in and look at the different steps that Misha mentioned, all of these have potential copyrights issues arising from that. I say potential because, again, we think of data as a monolith, but it’s really not. Data is how a computer system or an AI system looks at an input, but ultimately, and quite fundamentally, data can be just about anything.

Misha mentioned pictures of cars and whatnot, but the data that these systems in the insurance company example might be using could be also weather data or different maps and different other pieces of information that are used and combined and considered by these computer systems. That data is not a monolith. There could be various ranges of protection, or not, of that data under copyright.

If you look at a database of photographs, there’s the copyright in those works that does exist. If you look at a little bit more information, less artistic, works, they might not meet the threshold required for copyrights. If you move along the value chain there, there’s other potential copyright implications. I know Misha is quite keen on talking about labels these days and copyrights, so Misha, go ahead.

Misha: I think that’s a good example. I think it’s important to show that, at each step, there’s different actors that might have different copyrights in different elements. If you were, for instance, training an AI model to generate works of art and you wanted it to learn specific elements from different works of art throughout history and you had art history major look at those different works of art and bring the computer’s attention to certain elements by labelling certain aspects, then there’s definitely an element of skill and selection to those labels and those labels are protectable by copyright as well. The person doing that could claim copyright in those labels.

Paul: Right. That’s the threshold in Canada. It’s the exercise of skill of judgement that generates the ability to claim copyright on the work. So I think to Misha’s point is that not all labelling is automated and skill-less. Quite the reverse because the computer doesn’t really ascertain or determine any meaning to things. You could label images cat or dog, but an AI program doesn’t instinctively know the difference conceptually between those two things. It only knows to associate labels with images and then generalize that input to situations that it hasn’t seen before. That’s why the labels are so important.

Misha: Then another interesting area where the copyright question by law hasn’t been explored is when you train an AI model that manifests itself as [unintelligible 00:11:59] parameters in a model. That’s code that’s expressed, that’s being changed, based on images or data that the model is being exposed to.

When you have one person that owns the model, one person that owns the data and it could be a third person that’s training the model on that data, you‘re creating a certain expression of code based on that. Normally, that’s dealt with by copyright of who owns that modified code. But the fact is, there’s copyright there that’s very valuable that needs to be expressly dealt with or else there is a grey zone in certain instances.

Paul: To come back to your question about what issues exist in current copyright law. The issue is that the Copyright Act is relatively prescriptive with a bunch of different uses of works, because anything protected under copyright is called a work under the Copyright Act. The Copyright Act, I think, intuitively covers a lot of things really well, like books, printed works, the right to take them to another format. The Act itself looks at a bunch of different scenarios and works and then prescribes, okay, what actions can the copyright holder take and authorize.

The thing is, for data, the use that’s being made by AI systems doesn’t fit neatly into any of those definitions. Quite fundamentally, when an AI system is, "using" a work to be trained upon, it’s not really using the work intrinsically. It’s not using it because it wants to play a song or translate a work. Really what it’s doing is using this corpus to learn, much like we would read a book and carry that knowledge with us moving forward.

The issue is that, with computer systems, the training process itself creates a ton of different copies. It is the different steps that Misha mentioned. You can see that there’s different local files and folders that comprise this. There’s a bunch of different copies that can be made along the way. So for folks that are training AI systems, if you take a very formalistic look at the Copyright Act, and if you look at a plain reading of the Copyright Act, then all of this training falls into a grey zone that really makes these kinds of trainings, really not really clear from a legal standpoint, what leg people are standing on.

When I spoke in a parliamentary commission on copyright reform on this topic, the goal was really to shed light on that issue around data usability and showing that the fair dealing exemptions we have, which could give some clarity, those exemptions wouldn’t really cover a number of the use cases that we see for AI development. It’s really taking the Copyright Act and looking at training of AI systems and trying to figure out if there’s a gap. And we thought that there was one that’s warranted. There’s many ways to encourage an AI ecosystem and a healthy, positive, legal framework that gives them certainty is definitely one of them. That was the crux of what we put forward at that time.

Lisa: Yes. And of course, the protection is going to be a key issue in here. Because the training data, because it is in a grey zone, is actually what makes the AI system worth anything. It needs it to learn. Without that data, it’s going to become useless. But you’ve worked on some solutions here together with some of Canada’s leaders in the field of AI. You’ve co-authored guidelines to help someone who’s making data available so that they can decide what can and what cannot be done with it. Can you talk a little bit about the guideline that you wrote and what you wanted to do with it?

Misha: Absolutely. What we noticed is that a lot of people are making really great data available in a variety of formats. Universities are doing it a lot. A lot of people are spending their free time working on fantastic training sets, for AI in particular. And a lot of people are just uploading things in other contexts that happen to be useful for AI. Generally, what we’ve seen is that people are using open source software licence terms to protect or limit what people can or can’t do with that data. We’re all relatively familiar with the open source software licences. They’re fantastic, they’re actually very, very used for AI models as well, which is why the interest industry is so familiar with them.

I think we could go deep into that, but I think there’s going to be another episode on that later on in your podcast. But the fact is that data is information, and it’s being used as information not as a copyrightable work, and so the current open source software framework really doesn’t apply to the uploading of that data. Maybe Paul could go into some of the problems of what happens when we try and apply that framework to this data.

Paul: Yes, exactly. Basically, when we say open source, we mean software that’s made freely available, and by freely it means that it’s made available with the ability for folks to use it in any context that they will in any way that they will. The thing is, those licences really apply to computer works and not to works when used as data. The gap there is that, these licences are binding contracts, but they don’t really apply to the subject matter. So because of the intricacies of how data is used and, more importantly, because of the AI value chain, we found that most if not all of the licences that were used quite commonly were inappropriate because they didn’t really give that granularity for folks to really consider the AI value chain.

A lot of these licences basically say, you can use the work without really expanding that notion of use. That’s really not exactly giving the right level of control and granularity over what can actually be done because there are different interests in the copyrighted data. You can modulate those permissions, or you should be able to modulate those permissions and give a little bit more granularity about what can and can’t be done. And I think that’s the spirit of it.

Misha: A few quick examples to show, first of all, what the problem is with the current language. Let’s say I uploaded a bunch of images, it’s called synthetic data, a bunch of images of people of fake people’s face, to use in order to count the number of people in a crowd. I might be very comfortable with that use case of counting the number of people in a crowd here in Canada, but I could be worried, let’s say, that it would be used for military applications. I might say that I‘m uploading this data but anything that, in order to protect it, I‘m going to use a very common software licence, called Creative Commons ShareAlike, to make sure that anything that’s using this data is shared back to the community.

So if I was uploading software, that would work very well because any time that anyone modified that software, they would have to share that modified software back with the community, be able to monitor what’s going on and also reap the rewards of being in an open source community. The problem is because I uploaded data and not software, that data could easily, under the more commonly used software licences, anyone could use that data and they wouldn’t have to share back the model that was trained on it.

We see that there’s a gap between the intention of the person that uploaded that data set, where that person wanted everything to be shared back to the community, and the actual legal effect, where this is not a derivative work under the terms that licence, so there’s no sharing obligation.

Paul: Another good example is health data. Let’s say, a hospital or a research arm of a university that has interesting and valuable health data. Well, obviously, between the privacy of that data, the economic value of that data and my mandate as an organization to protect it, I may or may not want folks to do different things with it or train different models with it. But it doesn’t mean that I don’t want to put that data to a good use.

So if I‘m sharing it, and instead of saying you can use it, I’m breaking down the different rights in the AI value chain and granting permissions based on that, that gives me more control and more certainty around what people can do rather than just say you can use it this way or use software licences that don’t really work.

With that in mind, the paper that we wrote, those guidelines, made their way to what we call the Montreal Data Licence, which was basically mapping all of the different use cases and rights that you can grant or not with respect to data. That was really the goal, not to say what people should or shouldn’t do but create a framework in which there were additional terms and additional concepts that they could work with that were better suited to data and AI.

Misha: Yes, exactly. To bring it back to, let’s say, the example once again, I would be able to say under this data licence framework, "You can use this data set but any labels that you create on the data set have to be reshared with the community" or "You can use this data set but the labels and the trained model have to be reshared with the community," for instance. Or "You can use this data set once again for commercial use, but you cannot distribute a model trained on this data set without seeking my permission," for instance.

That way, you can prescribe what people can do. They can experiment with it, but if they wanted to commercialize it or put it into production, they would have to come back and see you to make sure that either maybe you wanted to benefit commercially so that you could sell them a licence or you could have ethical concerns about the use of the data that you uploaded. So this would allow you to have oversight into how that eventual model is used because it was trained on your data. It’s really linking the actual use that people make of data to the economic or oversight rights that people would like to have when they make data available.

Lisa: We’re talking a lot about the use of data and the people that are providing the data to be used and the Montreal licence that you wrote. If we now shift the focus a little bit and we think about entrepreneurs that may or may not be aware of these limitations, I imagine you’ve met some of them, entrepreneurs and inventors that are struggling with how they can use data. What do you think are the common mistakes entrepreneurs that are relying on data make and how can they avoid these mistakes?

Paul: I think the first one is really, really one of knowledge. That’s the first issue. If someone started a business and said look I want to sell a book based on my 20 favourite poems from authors that I love, they would intuitively know that they don’t necessarily have the to commercialize that compilation, that they would have to go seek permission from the different authors.

That kind of knowledge, and I guess culture, around copyright exists for a bunch of different works, but it definitely does not exist around data and around understanding what happens when you use data that you don’t necessarily have the rights for in the field of AI. I think the first thing we’ve seen is definitely just a lack of awareness that this exists and can materially impact your business. I think that’s the first starting point.

Misha: Exactly. Before people check whether or not the terms work for what they’re trying to do, they have to know to check for the terms, right? Then I think people take open source software language at its face value when they’re looking at data as well. Often people will take data that was intended for a certain purpose. For instance, a lot of competition data is misused. I think a lot of people will crawl or do both, downloads of data, in ways that aren’t necessarily allowed or even really ethical depending on what you’re looking at and what you’re crawling.

Paul: There’s different other rights at play too. A lot of what we discussed today is IP, is copyright, but the data sets that you’re working with, there’s other considerations there. I think the source of it is one topic we just talked about, but also the privacy interests. They could be at play and different pictures that you, or data points that you have. There’s obviously how that data is collected itself, its quality and whether is it a representative sample or is it discriminatory. And what you feed into an AI system is extremely important. That’s another legal issue without being one of IP per se.

Misha: Yes, absolutely. Sometimes people are just using data sets that are no longer relevant. I think a lot of data has changed very markedly over the past few years with the pandemic. And if you’re not finding ways to get up-to-date data or adjust for new data types or sources that have more recent value, then that previously excellent data that you had may not be as useful.

Another thing that we’re seeing is that a lot of people selling data, or a lot of data brokers. We’re getting much more knowledgeable now about the terms that should be included in data agreements, but I think both on the entrepreneur side and on the SME making data available side, there’s a lot of people who don’t understand, don’t discuss properly two things. One is, what is the consequence of the fact that, once an AI solution has had access to data, even if it doesn’t access that data anymore, there is still value there.

For instance, an AI, a facial recognition software that was trained on people’s faces, you can delete that data. But at the end of the day, if you’re not retraining the model or deleting the weights associated with that data, then the damage is already done where the use has been made of that data. The other thing is talking about the training version of the model as a separate item of copyright or separate work. The ownership and the different terms should be specifically and very granularly discussed among the parties.

Paul: With that in mind, I think as AI makes its way into smaller businesses and into more aspects of our daily life, if you’re responsible for these systems, you have to realize that the data on which they’re trained changes those models, their accuracy. And how they more or less portray the problem they’re trying to solve changes. Your responsibility, if you have the hands on the wheel of these systems so to speak, is to make sure that that’s always a good outcome that you’re creating.

That means sometimes retraining the models on new data and making sure that that’s still a valid and accurate model. But the thing is, if that data disappears or if the licence terms have changed, there’s a number of considerations there. You can’t always assume that you’ll have consistent and constant access to the data you’re using. But on the reverse, it also means that, as businesses, the data you do produce and the insights that you have as a business are actually potentially quite valuable, and there are ways to craft and develop new kinds of partnerships around that in pretty creative ways that—ultimately better understanding your data but also how, when and why you’d license it or give access to it to others, including for AI—is a great tool. There’s really interesting outcomes to be had with that mindset.

Lisa: If someone’s working on AI systems and is using a lot of data, do you have a place they can go to learn more about AI and data?

Paul: The paper that we reference. It’s called "Towards Standardization of Data Licences." A simple search on the internet and you‘ll find it. But if not, you can look up Montreal Data Licence. I think that’s a good starting point to highlight all of the issues that we mentioned and a good explainer of the AI value chain and those different steps along the way. Again, the name of the paper is "Towards Standardization of Data Licences," but if you look for Montreal Data Licence, you should be able to find it.

Lisa: Well, Paul and Misha, it’s been very technical but very interesting. Thank you so much for explaining what some of these issues are and actually for sharing some of the solutions as well for some people that might be looking at using data in their AI system. Thank you so much.

Misha: Thanks for having us.

Paul: Thank you.

Lisa: You’ve listened to Canadian IP Voices, where we talk intellectual property. In this episode, we’ve taken a deep dive into AI systems to understand how the authors and users of the training data that goes into AI systems may or may not be protected by copyright and the kinds of licences that authors can use to better control how their data is used in AI systems and machine learning.

If you’re interested to learn more, look up the description of this episode, where we have a link to the paper "Towards Standardization of Data Licences, the Montreal Data Licence," a paper published in 2019 that both Paul Gagnon and Misha Benjamin co-authored together with expert peers and scientists working at the leading edge of artificial intelligence.