Columbia Engineering Students Win 1st Place in NeurIPS Minecraft Challenge

Team leader David Watkins discusses how playing computer games can give roboticists real-world insights

Mar 10 2022 | By Bernadette Ocampo Young
David Watkins

When Minecraft first debuted in 2011, Columbia engineer David Watkins was one of its beta users. Little did he know that a decade later, he would combine that experience with his own academic research to claim first place at a prestigious artificial intelligence conference.

Minecraft, a multi-platform “sandbox” video game in which players explore a three-dimensional, other-worldly environment, is one of the world’s most popular games with over 100 million players monthly. As such, organizers of the third annual BASALT: A MineRL Competition on Solving Human-Judged Tasks created an opportunity to use the game to develop AI tools.

Minecraft’s long-lasting popularity is partially due to the fact that there are no set rules to obey or goals to achieve. This allows players a lot of freedom: they can craft new objects and mine resources as often as they want.

For the 2021 Basalt competition, officials codified a goal: teach a Minecraft robot agent to accomplish four tasks–find a cave, make a waterfall, build an animal pen, and build a house–while a group of judges evaluates the quality of the agent's performance. Participating teams had to use video demonstrations of people playing Minecraft to train their AI to perform those same tasks in new environments. Judges evaluated how effectively those teams devised new strategies for leveraging raw human demonstration data—an ongoing challenge for real-world AI experts.

Watkins’ team took home a $5,000 first-place winning prize, as well as an invitation to present their solution at NeurIPS. They also were awarded an additional $500 for developing the most human-like solution.

Watkins, who is advised by Professor Emeritus Peter Allen, is a PhD student in the Columbia Robotics Lab and a fellow at the Army Research Lab. For three months, he worked with colleagues there and at the University of Maryland, Baltimore County—Vinicius Goecks, Nicholas Waytowich, and Bharat Prakash–to come up with a system that garnered top honors in both overall performance and most human-like agent. Columbia Engineering caught up with Watkins to talk about his winning system.

Q: For most people, Minecraft is a fun game to play. How did you turn it into a fun research project?
Something that can be challenging as a researcher is finding real-world analogues to everyday environments. Minecraft in many ways shares the open-ended nature of problems we want to tackle as roboticists.

The competition was both fun and hard work. Seeing the robot agent autonomously and successfully move around in the game to complete an arbitrary task was such a rewarding moment. It is also great to participate in a competition to encourage the research community to build new deep learning procedures with challenging data.

Q: What was the challenge you set up for yourselves? How did you meet it?
When we first started this competition, we set a guideline for ourselves to approach this the same way we would if we were trying to have a robot act like a human being. The most logical approach was to create a hierarchical system that used AI, a state classifier, to gain an intuition about its surroundings and then employ a series of strategies based on that understanding. This is referred to as “human-in-the-loop learning.” Most researchers avoid this type of learning because collecting human demonstrations is difficult and time-consuming. In this case, we had the good fortune of a labeled dataset provided by the competition organizers.

For example, one of the competition subtasks was to place a waterfall and take a picture of it. To do this, our AI had to know the agent's position in the gaming environment at all times—something that is both crucial and complicated for AI to pull off. We humans take for granted how easy it is for us to localize ourselves in our environments using just our vision, but robots need to be trained to have a similar skillset.

To accomplish this, we designed a state classifier, which in our case classified image content by whether it has a mountain in it or not. If a mountain has been found, the agent goes towards it. Upon reaching the top of the mountain, the agent then creates a waterfall and moves a fixed distance away to take the picture. Getting the agent to be human-like during this step was particularly challenging.

Make Waterfall

The “Make Waterfall” State Machine diagram

Q: How did your expertise put your team over the top?
The biggest advantage we had was adhering to realism as closely as possible. To train the state classifier we hand-labeled the trajectories in the gameplay videos provided by the competition. We had to go through over 200 videos and made 88,000 hand-labeled annotations. Hand-labeling data takes time, but the easier it is to annotate data, the better your AI will perform. Other teams in the competition tried to avoid using human-generated data as much as possible whereas we embraced the usefulness of the dataset in our solution. Our team has lots of experience using human-generated data to train neural networks and I believe our expertise in human-in-the-loop learning helped us win the competition.

Q: How did your work in the competition relate to your real-life research projects?
This competition was similar to research I have previously done in robotic navigation because the majority of the tasks in the competition are focused around moving the agent around without collisions and the agent did not have a global position at runtime. This work is also similar to human-in-the-loop learning that my team at the Army Research Lab has previously worked on.

Q: What are your takeaways from the experience?
We all learned a lot about building good software for labeling data. The faster you can acquire new training data, the better your output is going to be. This work will be utilized in future human behavior modeling research for training robots to operate like a human player. Similar methodologies are being used for Starcraft 2 and Dota 2 gaming deep learning research.

This research isn’t only useful for gaming, however. Now that we have a good pipeline for capturing human data in a virtual environment, we could map any navigation task using visual information onto a real robot.