“MZ: In order to train your networks in reasonable time schedule, we need something like GPU and the GPU requires no free driver, no free firmware, so it will be a problem if Debian community wants to reproduce neural networks in our own infrastructure. If we cannot do that, then any deep learning applications integrated in Debian itself is not self-contained. This piece of software cannot be reproduced by Debian itself. This is a real problem.”
[00:00:47] SF: Welcome to Deep Dive AI, a podcast from the Open Source Initiative. We’ll be exploring how artificial intelligence impacts free and open-source software, from developers to businesses, to the rest of us.
[00:01:01] SF: Deep Dive AI supported by our sponsor, GitHub, open-source AI frameworks and models will drive transformational impact into the next era of software, evolving every industry democratizing knowledge, and lowering barriers to becoming a developer. As this revolution continues, GitHub is excited to engage in support toy size, deep dive into AI and open source and welcomes everyone to contribute to the conversation.
ANNOUNCER: No sponsor had any right or opportunity to approve or disapprove the content of this podcast.
[00:01:33] SF: This is an episode with Mo Zhou, a first year PhD student at Johns Hopkins University, official Debian developer since 2018. He’s proposed the Machine Learning Policy for Debian research recently. He’s interested in deep learning and computer vision and among other things.
[00:01:53] MZ: Hello, everyone.
[00:01:54] SF: Thanks for taking the time to talk to us. I wanted to talk to you in the context of Deep Dive AI. I would like to understand a little bit better the introduction of artificial intelligence of what it means for free software and open source, what are those limitations? What new things it has been introducing? What kept you interested to volunteer and think about machine learning in the Debian community?
[00:02:21] MZ: Well, actually, artificial intelligence is a long existing research topic. I think we can split your questions into small ones, so I can handle this.
[00:02:34] SF: Absolutely.
[00:02:36] MZ: Where should we start? Let’s start from a brief introduction of what is artificial intelligence. In the last century, there was already some research about artificial intelligence. You may have heard some old news that computers can play chess with human players and human players are beaten by computers. That’s a very classical example of artificial intelligence. Back in time, artificial intelligence involves many well manually crafted things like if you design a computer program that can play chess with you, there is basically a searching algorithm that searches for a good play for the next step based on the current situation on the check board. There are many manually crafted things.
Recently, there are some factors that bring some changes to the artificial intelligence research community. The most important two factors are big data and the increase of hardware capacity. There are lots of hardware that is capable of parallel computing like GPUs, and FPGAs. These hardware are very important, without them the recent advancements of deep learning is impossible.
[00:04:05] SF: Right. So basically, you’re saying that the old chess playing games, they had a database of possible moves, and what they were doing, they were searching quickly between possible alternatives and evaluating the best option?
[00:04:19] MZ: Yeah. That is classical algorithm. Nowadays, if you’re looking to AlphaGo, that’s very different from the past algorithm.
[00:04:29] SF: Right. AlphaGo is the automatic player for Go.
[00:04:34] MZ: Yeah.
[00:04:35] SF: Which is now a lot more complex than chess from my memory.
[00:04:39] MZ: Yeah. Basically, recent algorithm can handle very, very complicated situations. I can give you a very simple example. Imagine that you’re a programmer. Now I present you two images, one with a cat and one with a dog. How do you write a progra that can classify the two images and how you, which is dog, which is cat? So basically, recent artificial intelligence can handle such complicated scenario, and is much more capable than what I have said.
[00:05:16] SF: Right. Okay, so how do they do that?
[00:05:18] MZ: The recent advancements are based on two factors, big data and computational capability. Let’s start from big data. If you want to do some classification of the cat and dog images, first, you have to prepare a training data set. For example, you take 100 photos of various kinds of cats, and another 100 photos of various kinds of dogs. Then you can label all the images you have collected. Then this is called a training data set.
[00:05:59] SF: The training data set is basically the raw pictures plus some metadata describing them that a human puts on.
[00:06:08] MZ: Exactly. Given such a data set, we then construct a neural network. This neural network is composed of many, many layers, such as convolutional layers, nonlinear, activation layers, and fully connected layers. Almost all of these layers comes with some learnable parameters, where the knowledge the neural network has learned is stored, okay. Given such a neural network and you input a image into it, and it will give you a prediction, it will predict whether it has a cat or a dog. Of course, without training, it will make wrong predictions and that’s why we have to design a loss function to mirror the discrepancy between its real output and our expectation. Then, given such a loss function, we can do back propagation and stochastic gradient descent. After that process, the neural network will gradually learn how to tell which image is cat, and which image is dog.
[00:07:23] SF: Okay, so software that runs in your phone, that tells you whether you’re snapping a picture of a dog or a cat, that software in the past, if we were talking about non-AI systems, traditionally, if you took a picture of something, you stored it on your computer, you snap the picture. The software was not involved into doing anything, but storing and retrieving it from the file system. Now, if you add a search engine inside that application that detects your pet in your collection of pictures, we’re adding a little bit of complexity. The neural network that has been trained to detect cats and dogs, now, if we wanted to distribute that piece of software inside Debian, or inside one of the few free software, mobile open-source systems to help retrieve our pictures, what do we need?
[00:08:20] MZ: Actually, we need lots of things, especially if we are doing distribution of free software. If we create a artificial intelligence application, we will need data. We’ll need the code for training neural network. We will need the inference code for actually running the neural network on your device. Without any of them, the application is not integral. None of them can be missing.
[00:08:52] SF: The definitions that we have right now for what is complete and corresponding source code, and how can it be applied to an AI system to an application like this that detects pictures of dogs?
[00:09:04] MZ: Well, actually, the neural network is a very simple structure, if we don’t care about its internal. You can just think of it as a matrix multiplication. Your input is an image and we just do lots of matrix multiplication, and it will give you a output vector. This is simply the things happened in the software. Both training code and the inference code are doing the similar thing.
Apart from the code, the data is something that can change. For example, we can use the same training and inference code for different data set. For example, I released a code for cat and dog classification problem, but you can decode and you say, “Oh, I’m more interested in classifying flowers.” Then you can collect new data sets about different kinds of flowers and use the same code to train the neural network and do the classification by yourself.
If you want to provide a neural network that performs consistently everywhere, you also have to release the pre-trained neural network. If you are releasing free software that also requires you to release the training data as well, because free software requires some freedom that allows you to study, to modify or to reproduce the work. Without any training data, it is not possible to reproduce the neural network that you have downloaded. That’s a very big issue.
Nowadays, in the research community, people are basically using neural networks that are trained on non-free data set. All of the existing models are somewhat problematic in terms of license.
[00:11:10] SF: Why is this happening? Do you know? Do you have any sense?
[00:11:12] MZ: Yeah, the reason behind this is very simple. Because to train a functional neural network, you have to collect many, many data. For example, you want to make a face recognition application. Then you have to collect face data. Then who can collect such large-scale data set? It’s only big companies can do this. It is very, very difficult for any person to do this.
[00:11:43] SF: It’s definitely not something like an amateur can do in their spare time in their bedrooms.
[00:11:51] MZ: Yeah. Nobody can do a large-scale data set. For example, nowadays, the most popular data set in the artificial intelligence field is called image net. It contains more than 1 million images with 1,000 classes. If you want to do a free software alternative, you need lots of people to do the labeling work and the image correction.
[00:12:18] SF: Of course, because now this image net dataset, I’m assuming is not available under a free open source, or free data, open data license.
[00:12:28] MZ: Yeah. It is not free. It is basically for academic purposes only. There are lots of pre-trained models across the Internet. Basically, everyone can use them and download them and use them. There are potential license problems behind this.
[00:12:48] SF: Because you’re saying that this database has images and labels, it’s a time consuming process to apply them and classify images this way.
[00:12:59] MZ: Yeah, it is very time consuming and costs lots of money.
[00:13:03] SF: Of course. How about text-based data in other types of data that is not images?
[00:13:10] MZ: Well, you mentioned text. That’s another interesting topic, because recent advances of artificial intelligence has brought significant change into research area. The first research area is computer vision. It is about like what we have said, you classify cat and dog images. Another field is computational linguistics, or natural language processing. It has lots of applications, such as machine translation. For example, the Google Translate, it is based on neural networks. Now, text-based data is relatively easier to collect, because you know, we can simply download the whole Wikipedia dump as a training data. It is since they’re license and is free.
[00:14:02] SF: Right. You still need to classify, you still need to do other passes, or?
[00:14:08] MZ: Well, it depends on what kind of test you want to deal with. For example, if you want to do the machine translation, then you can simply download the – for example, the English version of Wikipedia and the Chinese version of Wikipedia. Then, as long as you can find the English and Chinese centers correspondence, you have already got a usable machine translation training set.
[00:14:39] SF: We have these datasets that are proprietary and hard to distribute. There are trained models that are being distributed that depend on this original dataset. Now, one of the rights of users of free and open-source software is that they can modify software to fix bugs. If we have a model that has some difficulties in identifying European faces from African faces, for example, in a face recognition algorithm, or some other issue with dogs and cats, do we need to have as recipients of the software to fix this bug, start from knowledge?
[00:15:18] MZ: Yeah. Actually, the question you have mentioned is a very good question. For example, if we train a face recognition network, and in some cases, if your training data set contains only a few, for example, Asian face, and your network will expectedly performs that on such Asian face. This is a notorious and famous issue called data set bias. It is a cutting-edge research topic. People are working on this. I think, this issue will be overcome sometime in the future. This problem exists.
If we want to deal with such issue nowadays, what we can do is to collect more data. For example, your neural network behaves that, on cat data. Then you just simply collect more cat data, and train your neural network again. If you want to do this, you will find that to train the neural network yet, you need the original training set, so you can put more images into it. You also need the training code to produce a new neural network.
[00:16:40] SF: That’s a really important thing to understand. In order to modify an existing model, we need to have access to not only to the original data set, but also the software to train it. We need to know about how that training model has been configured to train. What do we need to tweak in there? Do we, for example, in the input parameters into the training set. Do we know if we see black dogs are constantly misinterpreted as cats? Do we know how to retrain the system in order to give a better answer on that front, besides just giving it more data?
[00:17:21] MZ: In the research community practice, apart from the training data set, we also collect a validation data set. The validation set is basically the same setting as the training data set, but there are new images and new labels. The two data sets are not overlapping. If you do training on the training set, your neural network has not synced any data in the validation data set. After your training process, you can do test of your neural network on the validation data set. If the performance is good on both training and validation data sets, then this neural network is good enough. After you have adjusted the neural network, you will also do the validation process to make sure the neural network you have obtained is sensible.
[00:18:18] SF: How predictable is the result of retraining? If I change the parameters, the input parameters of retraining the data set, do I know that I fixed the bug, or how will I know?
[00:18:31] MZ: Actually, this process requires some background knowledge and some experience. If you are a engineer in the related field, you will find it is very easy, because if you obtained a copy of code that is known to be working well, basically, you will not encounter any trouble. As long as you don’t change too much coding cited, or significantly change the parameters, like learning rate or something alike.
[00:19:04] SF: Let’s assume that we have the original data set, we have all the elements to retrain the model. Now, let’s go on the hardware level. You said, we need storage for sure and we need fast storage. Then the computation side, what else do we need?
[00:19:20] MZ: If you search for deep learning framework on the Internet, you will find many, many solutions that works virus of hardware platform, like mobile phones, tablets, personal computers. These frameworks are designed to be not specific to any hardware. What you can gain from some powerful hardware is the speed. For example, if you drop the same neural network on your personal computer with a strong GPU, and it may run several 100 times faster than your mobile phone. If you are a researcher in this field, you will quickly figure out this speed issue is critical, because if you train a neural network on CPU, it may require several years. If you got a strong GPU, it only takes several for hours. This is ridiculous.
[00:20:20] SF: Deep Dive AI is supported by our sponsor, DataStax. DataStax is the real-time data company. With DataStax, any enterprise can mobilize real-time data and quickly build the smart, highly scalable applications required to become a data-driven business and unlock the full potential of AI. With AstraDB and Astra streaming, DataStax uniquely delivers the power of Apache Cassandra, the world’s most scalable database, with the advanced Apache pulsar streaming technology in an open data stack available on any cloud. DataStax leaves the open-source cycle of innovation every day, in an emerging AI everywhere future, Learn more at datastax.com.
[00:21:00] ANNOUNCER: No sponsor had any right or opportunity to approve or disapprove the content of this podcast.
[00:21:05] SF: Speaking of Debian and going back into the free software concerns and open-source community concerns about training data sets, regarding the hardware, one of your papers from a few years back was mentioning the difficulty in getting access to accelerated CPUs, GPUs and some functions inside some of these processors that were not readily available inside Debian. Can you elaborate a little bit on that?
[00:21:32] MZ: Debian is a open source and a free software community. It is very strict in the practice of free software, because our official infrastructure are all based on free software. In order to train neural networks in reasonable time schedule, we need something like GPU and the GPU requires non-free driver, non-free firmware. It will be a problem if Debian community wants to reproduce neural networks in our own infrastructure. If we cannot do that, then any deep learning applications integrated in Debian itself is not self-contained. This piece of software cannot be reproduced by Debian itself. This is a real problem.
[00:22:32] SF: I totally understand it. I mean, for me, Debian has always been the lighthouse that you look after. If you want to know if a package is really giving users the freedom to run, modify, copy and distribute.
[00:22:47] MZ: Yeah. That is very strict in this regard.
[00:22:51] SF: Right. You’re basically adding a new element of restrictions in what a fully open-source AI system can be. You’re putting hardware as a piece of this element, because it’s fine if you up to a point, you have the data set, you have the training model and parameters and all that stuff. You have all the source code and you still cannot retrain your system to fix a bug, unless you have 10 years to wait for, then you have a problem. What are the efforts to try to overcome this issue with the hardware drivers?
[00:23:31] MZ: Yeah, this is a tough topic for the open-source community. Lots of endeavor are put into Nvidia driver reverse engineering. Nowadays, the free driver of Nvidia GPU is still not available for CUDA computation. The CUDA is what we needed for training in neural network.
[00:23:57] SF: Also, the recent announcement from Nvidia still does not really help on the AI training front.
[00:24:04] MZ: That’s just helped a little bit. Nvidia has lots of software. The open-source driver is only a tiny bit of the whole ecosystem.
[00:24:17] SF: Okay, and how about other hardware manufacturer, like LD new announced chipsets from Google, from Apple. They seem to mention the fact that they have some AI capabilities, some AI instructions in there. What do you think of those?
[00:24:33] MZ: There are lots of new hardware manufacturers. You mentioned Google, right? They have their own Tensor Processing Unit. Currently, I don’t see any of such TPU available on the market. Personally, we cannot buy it. There is no way for individual free software developers to look into such thing. You also mentioned Apple. Yeah, they have done very good advertising on their new chips, but their corresponding ecosystems are not free. This is also a tough issue if you want to port your free software onto these platforms.
Basically, I think the big companies are responsible for doing this and there is no way for individual developers to do it. Apart from Apple, there are also AMD and Intel. The two manufacturers are releasing open-sourced computing software in order to compete with Nvidia. Currently, Nvidia’s CUDA computing software is dominant in this market. AMD has released their [inaudible 00:25:48] as a competitor. Recently, Intel also came up with one API to compete with Nvidia. Nowadays, only Nvidia is providing proprietary software solution for deep learning.
I think there is still a very long way to go for AMD and Intel, because Nvidia’s product is very mature at the current stage. This role cam and Intel’s one API are still very new. Our market still need some time to verify their new product to see whether they work or not.
[00:26:29] SF: Right, right, right. Oh, it happened in the past that smaller architectures that were more open, eventually took over just with the work of large groups, like Debian and other in the open-source world. Starting to think about the future, what does the future look like to you? What would you like to see inside Debian, an ideal scenario?
[00:26:51] MZ: I have to say, my opinion is a little bit of pessimistic, because there are various drum obstacles, if we want to do some hardware support, or data center support. The two factors just requires lots of money to do. That is difficult even for big companies. What I am expecting in the free software community is that we can continue to provide a solid system for production, for research. We can support these applications and deep neural network frameworks. We can do this very well. As long as our users want to train your network, they may have to rely on external software, such as some random code downloaded from GitHub, or something like –
[00:27:47] SF: What kind of licensing schemes are more popular in the AI research community?
[00:27:54] MZ: Well, based on my own experience, the most popular license among this research community is Apache 2. Some of them are BSD style license, or MIT style license. Well, this license are very popular among the research community. if you are interested in some research paper, and you’ll find the corresponding code and the code is basically open source. The problem still stems from the corresponding training data, because many useful data sets are not free software. You got free software training code and inference code, but the data is not.
[00:28:39] SF: Yeah, so we go back into the fact that there is no clear understanding of the copyleft concept is not applied, or is not common among AI applications.
[00:28:53] MZ: Yeah. There is not a clear understanding on this issue. Many researchers just released their neural network, but what license should we give to the trained neural network? Basically, nobody can answer this. We know there is a problem if we don’t clearly state a license.
[00:29:17] SF: In an ideal world for you, what’s an open-source AI?
[00:29:21] MZ: Well, I’m Debian developer, so I stick to Debian’s free software guideline. We pursue for software freedom. As long as I get a free software AI application, I expect that I am able to download the training data set. I can study the training code, the inference code, and I can reproduce the neural network and I can also modify the neural network. That’s what I am expecting. I know this is very hard to achieve in the foreseeable future.
[00:30:00] SF: Yeah. Okay. It’s good to set the bar high and hope for the best. At some point, we’ll get there. That includes also getting free drivers in order to run training models in a significant short time. All right, Mo. It’s been a pleasure. It’s been a pleasure to talk to you. I think we have covered a lot of ground. You helped us understand what’s an open-source system. You helped us understand what an AI system is, what components we need to watch for, from the training data sets to the model itself, and the hardware required to run it. Thank you. Thank you very much. What are your plans for the future? What are you working on?
[00:30:41] MZ: I have not completely decided yet. I love doing research. I enjoy the research progress. Because by doing research, you are exploring the borderline of human knowledge. I really enjoy this process, as long as we can make some progress. Because you’re studying something that nobody knows. You are the first one on the earth to know that new knowledge. This is very exciting.
[00:31:10] SF: It is exciting. Thank you very much, Mo Zhou.
[00:31:13] MZ: Yeah. Thank you for your time.
[END OF INTERVIEW]
[00:31:16] SF: Thanks for listening. Thanks to our sponsor, Google. Remember to subscribe on your podcast player for more episodes. Please review and share. It helps more people find us. Visit deepdive.opensource.org, where you find more episodes, learn about these issues, and you can donate to become a member. Members are the only reason we can do this work. If you have any feedback on this episode, or on Deep Dive AI in general, please email firstname.lastname@example.org.
This podcast was produced by the Open Source Initiative, with the help from Nicole Martinelli, music by Jason Shaw of audionautix.com, under Creative Commons Attribution 4.0 International license. Links in the episode notes.
[00:31:59] ANNOUNCER: The views expressed in this podcast are the personal views of the speakers and are not the views of their employers, the organizations they are affiliated with, their clients or their customers. The information provided is not legal advice. No sponsor had any right or opportunity to approve or disapprove the content of this podcast.