Excerpts from the fireside chat by Gerard Medioni with Michael Black at the WACV 2024 conference in Hawaii on January 6th, 2024.

On my way to WACV24 conference to present my research paper on Human Pose Correction using Computer Vision, I had a layover at the Chicago airport. At the gate, I entered Hudson Nonstop, grabbed a pretzel pack, and walked out without any checkout or contact. It turns out that the store is one of the 150+ worldwide stores that are powered by Amazon’s Just Walk Out technology.

Little did I know, I would end up at a fireside chat of the conference where Gerard Medioni, the Professor Emeritus of CompSci at USC, who led the team behind the Just Walkout technology at Amazon Science, addressed the researchers. Renowned as a visionary in the field of computer vision, his audacious transition to Amazon demonstrated to the global community how profound academic insights could be successfully converted into tangible, real-world innovations.

The fireside, rather call it beachside, chat at WACV24 at Waikoloa in Hawaii flipped the orals and poser sessions of this computer vision conference on their head and made the researcher deeply think about long term problems and innovations that would end up impacting the real world.

Moderator: Michael Black, Max Planck Institute for Intelligent Systems

Michael: How do you get so many citations?Gerard: You’re dropped somewhere, not knowing where you are. We wrote a paper on localization. Reviewers asked, “Why do you do this when we have GPS?” 20 years later, someone asked where a particular picture was taken. My citations zoomed up, and people started calling me on their side.

Michael: How do you pick problems?
Gerard: Incremental versus durable. Incremental ones give you 2%. In 6 months, it gets irrelevant. Fundamental problems like optical flow will be there. Your paper gets cited a lot when it addresses a fundamental problem that can have multiple use cases.

Michael: How do you adapt and stay relevant when the field is constantly changing?
Gerard: Look for papers that have fundamental changes, similar to regularization. One paper is about snakes. Step back and see whether it’s going to be fundamental or not.

Michael: What prompted you to go into academia after your Ph.D.?
Gerard: Opportunity to continue research on how to turn pixels into semantic entities. I interviewed with defense. I wasn’t a US citizen, and I couldn’t see US data. JPL gave me an offer six months after I signed up for faculty. You can do a lot of planning, but be ready to deal with serendipity. In your career, there is no A/B testing. You get to a fork, and you have to make a decision.

Michael: At USC, you have quite a few startups and consult for them?
Gerard: That was my parallel career. I’m not an ivory tower professor where I invent a problem and solve it. Instead, I always looked for problems that give me value. In the industry, if people couldn’t solve a problem, it wasn’t because they were not good engineers, but they didn’t have scientific rigor.

CMYB is used to print an image. I got an engineer who called me, and he said it took 10 hours to print. He asked me to review his solution to shorten it. I told him the truth that it won’t work and told him I know how to make it work. After three months, he came back, and it led to four papers. Halftone registration started in the vision community then.

Michael: When you left USC and joined Amazon, from a beautiful place to a rainy place, how did you come to that?
Gerard: I was building a system for navigation for the blind. We had micro motors that vibrate left to turn left and vibrate together to stop, else you hit something. Patients loved it, and I wrote papers. I got an email from a person in the Midwest asking for its deployment. I got utterly depressed as it was not the next step for me; someone else must take my paper to make it for the world’s consumers. I wasn’t affecting the world, really, and then I got the call from Amazon. They told me to work for them without giving any details. I went there to give a talk. They told me to interview, and I said no. I declined their homework but ended up doing it. The VP told me to walk into a store, take whatever you like, and walk down. I told them this is crazy. I told them I had to hire a bunch of people, I told them I have to open an office elsewhere. I thought they were crazy. My wife told me I might regret it later if I didn’t sign up. Now it is a litmus test. I must now show the world that CV can solve world problems.

I joined in June, and by December, we had a team. When you see the mountain from afar, we don’t know the height, but when we go closer, the mountain gets higher. When we finally opened the store, it was a mega success. Amazon gave me the opportunity to really create products that affect millions of people, and it is a different type of reward.

Michael: You did Amazon One and palm scanning. Shipping a product versus writing a paper. It sounds like you prefer a product.

Gerard: The product requires much more than research. When you get an engine, you need an army of engineers who can make a car. That is what industry lets you do. Day to day, I keep hiring post-Ph.D. students, guide them, and they guide you too. When we say no to the graduate students, they prove you wrong. The way to work on industry research is we write papers and also write a patent. We have to describe in great detail what it is, and a bunch of lawyers turn it into something that no one else understands.

Michael: OpenAI and others showed scale really matters. Large data and large teams too. What does it mean for academia?
Gerard: If you compete with labs that have hundreds of people working on the next LLM that requires billion data, that is not a good idea. There are many islands that are not necessarily at the forefront of what industry is doing. There are many problems that don’t have immediate economic value.

We are on a path to a perception machine, and it is a matter of giving it a new task. No. LLMs are amazing. However, there is no path to AGI. We can see their limitations daily. Large language data won’t go to human level. We have a lot of room for symbolic reasoning too. I was surprised myself by the impact of deep learning. In 2009, someone asked me to detect the birds in a tree. I said no. Three years later, the same person told me I had lied to him. This is why predicting is very hard. The first inflection point was deep learning. We used Caffe, but it was hard. Now we have LLMs. These things change the way we interact with computers.

Michael: You have played a pivotal role in these conferences. This is evident in the prize you received in 2019. Do you see a way the new generation can get involved?
Gerard: The first one I organized was in 1991 in Hawaii. I was an assistant professor then. I was both program chair and general chair. Now, we need to bring in volunteers as no one gets paid. First, you become a reviewer, then an area chair, and then a general chair. As you think about your career, think about how you can volunteer. We have many conferences, and all of them need volunteers. This is a call for volunteers.

Michael: Thank you for CVPR 1991 in Hawaii. I couldn’t believe my advisor paid for me to go to Hawaii! You always seem to be enjoying yourself. Any tips to make conferences enjoyable?
Gerard: I try to put conferences in good places as there is more to the conference than the conference itself. Learn the geography, network, and make connections. These are your colleagues; probably some of them work on the same problem. Some of them are forming affinities. Look at it as an experience. It is not just about presenting your paper.

Michael: You have your family here. Any work-life balance tips?
Gerard: Balance is key.

Michael: Have you considered quitting computer vision at any point? You get rejections from reviewers more often than not as an author.
Gerard: NSF has humbled me many times with rejections. The first impulse after rejection is to call the program chair. Rejection means three people rejected you. If your paper is not just incremental, submit it again.

Michael: As a new student of computer vision, I see the field is changing too quickly. What do I focus on?
Gerard: Focus on problems that have a long-term impact. Go to a new beach, not a crowded one. An example is the creation of a new dataset that I give my baseline to. You plant the pole at that new beach. Don’t just do incremental value. Don’t look at the next conference. At the end of your Ph.D., is your thesis relevant? You should ask this.

Michael: Papers of incremental performance are most welcome by the reviewers. What should I do?
Gerard: I don’t mean to say never do incremental. Some incremental work has value as it demonstrates that you understand the field. However, keep in mind, is there a problem that I can keep working on? As a grad student, if you hit a wall, the wall isn’t going to move. Always work on two things at the same time. The wall may move by you moving.

Michael: Skill set shift from academia to industry? Insights?
Gerard: In academia, the worst thing is you miss a publication deadline or proposal submission. In industry, when you miss a deadline for product delivery, the pressure is very different. A friend of mine said, “In academia, I underused curse words, but now in product development, I use them all the time.”

Michael: Does the number of CVPR papers correlate to “get things done” in industry?
Gerard: When we hire Ph.D. graduates, we look at the publication pattern, and CVPR and ICV stand very high. They indicate your ability to express your ideas and present them properly, which is super important in the community to earn trust with your peers.

Michael: When would you say computer vision is a solved problem?
Gerard: It is a solved problem when we obtain AGI. I don’t think computer vision in isolation is solved unless you solve the rest of the intelligence puzzle.

Michael: We split the single vision community into vision, robotics, and many other parts. With AGI, are we going to be one happy family?
Gerard: What used to be analog for different things turned out to be digital on the phone now. With LLMs, text and speech are similar. In language, tokens are clear. Vision is not the case. Solving language comes first before vision gets solved. We need a set of tools before convergence happens.

It was a spellbound session by Gerard and great positioning by Michael. Back in India, I was fortunate to head a National AI Technology Hub at IIT Kharagpur, which is part of a larger National Mission anchored on Translational Research. While the government is doing its part, top industries of India must tap the top scientists of IITs into solving the real world problems of India, just like Amazon did with Gerard to build Just Walkout technology.

Leave a comment