PhD Research

My research focuses on integrating robotic perception and action. I believe that “Through predicting perceptual action consequences based on memory and observation, a robot would be capable of solving novel tasks in a human environment.” However, action and perception are often represented separately–as object models or maps in computer vision systems and as action templates in robotics controllers; due to this separation, classification is often based on labels instead of how the robot can interact with the environment, therefore limiting how learned action can be generalized.

The goal of my research is to construct an integrated model of action and perception for interacting with the environment and generalizing past experience to novel situations.  I describe the aspect transition graph model and how it can be combined with belief space planning, visual servoing, error detection, hierarchical cnn features, object manipulation, and learning from demonstration in the following.

Aspect Transition Graph Model

In computer vision, there are two common types of object models used for identification. One represents objects in 2D and the other in 3D. However, neither of these incorporates information regarding how perceptions of these objects change in response to actions. A robot that recognizes objects with traditional models knows nothing more than the label of the object. It is clear that humans have a different kind of object understanding–they can often predict the state and appearance of an object after an action. Incorporating actions into object models allows robots to interact with objects and predict action outcomes.

Instead of an independent object recognition system, I proposed an integrated model based on the aspect transition graph (ATG) representation that fuses information acquired from sensors and robot actions to achieve better recognition and understanding of the environment. An ATG summarizes how actions change viewpoints or the state of the object and thus, the post-action observation.

Belief Space Planning with ATGs

An intelligent agent must reason about its own skills, and about the relationship between these skills and goals under run-time conditions. This requires the agent to represent knowledge about its interactions with the world in a manner that supports reasoning. Since the early 1970s, the AI and robotics communities have been concerned with the design of efficient representations that support modeling and reasoning. However, most of these representations tend to tackle only one part of the problem—making either the modeling or the reasoning problem easier.

I address these dual problems of modeling and reasoning through the Aspect Transition Graph model that is grounded in the robot’s own actions and perceptions. The description of state is domain general, as it is computed directly from the status of executable actions and not hand built for a specific task. I present a planner that exploits the uniform description of state and the probabilistic models to plan efficiently in partially observed environments.

Object Manipulation with Visual Servoing

Designing robots that can model unstructured environments and act predictably in those environments is a challenging problem. To make actions repeatable and predictable in unstructured environments, I introduced a novel image-based visual servoing algorithm that works in conjunction with the aspect transition graph model and demonstrated this approach on a tool grasping task.

In prior work, a controller is described as a funnel that guides the robot state to convergence. However under certain situations the goal state may not be reachable through a combinations of controllers that act like funnels. For example, the visual servoing controller controls the end effector to a certain pose based on the robot hand’s visual appearance. However to reach the goal state, a controller that transitions from a state where the robot hand is not visible to one in which the visual servoing controller can be executed is required. Such a controller can be an open loop controller that moves the end effector to a memorized pose and may not necessarily converge to a certain state like a funnel.

I introduce the notion of a slide as a metaphor for this kind of action that transitions from one set of states to another. Uncertainty of the state may increase after transitioning down a slide, but may still reach the goal state if a funnel-slide-funnel structure is carefully designed. I investigate how a sequence of these two kinds of controllers will change how an object is observed.

Error Detection and Surprise Handling

For an autonomous robot to accomplish tasks when the outcome of actions is non-deterministic it is often necessary to detect and correct errors. I introduce a general framework that stores fine-grained event transitions so that failures can be detected and handled early in a task. These failures are then recovered through two different approaches based on whether the error is “surprising” to the robot or not. Surprise is determined by comparing the information of an observation with its entropy. Surprise transitions are then used to create new models that capture observations previously not in the model.

High-level manipulation actions can be represented as a sequence of primitive actions and aspect transitions. For example, the “flip” macro action on the uBot-6 mobile manipulator is implemented as a sequential composition of the following four primitive actions: 1) reach, 2) grasp, 3) lift, and 4) place. These four actions connect five fine-grained aspect nodes that represent expectations for intermediate observations. These fine-grained aspect nodes allow the robot to monitor and detect unplanned transitions at many intermediate stages of the flip interaction.

Grasping with Hierarchical CNN Features

I introduced a solution for posturing the anthropomorphic Robonaut-2 hand and arm for grasping based on visual information. A mapping from visual features extracted from a convolutional neural network (CNN) to grasp points is learned. It is demonstrated that a CNN pre-trained for image classification can be applied to a grasping task based on a small set of grasping examples. This approach takes advantage of the hierarchical nature of the CNN by identifying features that capture the hierarchical support relations between filters in different CNN layers and locating their 3D positions by tracing activations backwards in the CNN. When this backward trace terminates in the RGB-D image, important manipulable structures are thereby localized. These kind of features are called hierarchical CNN features.

Hierarchical CNN features that reside in different layers of the CNN are then associated with controllers that engage different kinematic subchains in the hand/arm system for grasping. A grasping dataset is collected using demonstrated hand/object relationships for Robonaut-2 to evaluate the proposed approach in terms of the precision of the resulting preshape postures. I demonstrated that this approach outperforms baseline approaches in cluttered scenarios on the grasping dataset and a point cloud based approach on a grasping task using Robonaut-2.

Aspect Representation for Object Manipulation

In this work, an intelligent visuomotor system that interacts with the environment and memorizes the consequences of actions is proposed. As more memories are recorded and more interactions are observed, the agent becomes more capable of predicting the consequences of actions and is, thus, better at planning sequences of actions to solve tasks. I propose a novel aspect representation based on hierarchical CNN features that supports manipulation and captures the essential affordances of an object based on RGB-D images. In a traditional planning system, robots are given a pre-defined set of actions that take the robot from one symbolic state to another. However symbolic states often lack the flexibility to generalize across similar situations. This proposed representation is grounded in the robot’s observations and lies in a continuous space that allows the robot to handle similar unseen situations. The hierarchical CNN features within a representation also allow the robot to act precisely with respect to the spatial location of individual features.

I evaluate the robustness of this representation using the Washington RGB-D Objects Dataset and show that it achieves state of the art results for instance pose estimation. This representation is then tested in conjunction with an ATG on a drill grasping task on Robonaut-2. I show that given grasp, drag, and turn demonstrations on the drill, the robot is capable of planning sequences of learned actions to compensate for reachability constraints.

Learning From Demonstration

Learning from demonstration (LfD) is an appealing approach to program robots due to its similarity to how humans teach each other and have shown success in allowing non-experts to program robots on simple tasks. However, most work on LfD have been focusing on learning the demonstrated motion, action constraints, or trajectory segments and assume object labels and poses can be identified correctly. This assumption may be correct in certain industrial environments where fiducial markers can be used to identify and localize objects precisely but may not hold true in everyday human environments.

I propose a more integrated approach that also treats visual features as part of the learning process. This not only gives the robot capability to manipulate objects without fiducial markers but also allows learning actions that only depend on certain parts of an object, such as a switch on a panel. Instead of defining actions as relative movements with respect to the object pose, our actions are based on visual features that represent meaningful object parts. For example, when learning to grasp a drill by the handle, it might be beneficial to learn to place the fingers based on features that represent the handle instead of features that represent other parts of the drill such as the drill tip.


In this work, I show that a challenging bolt tightening task using a ratchet can be learned with as few as three demonstrations. The difficulty lies in the uncertainty of the in hand ratchet pose after grasping and the high precision required for putting the socket on top of the bolt. The proposed approach learns the desired relative pose between the socket of the ratchet and the bolt before tightening through identifying consistent feature offsets among a small set of demonstrated examples. This approach is not limited to bolt tightening and can learn all tool usage tasks that require tools to be moved relative to an object. I show that Robonaut-2 is capable of grasping the ratchet, tightening a bolt, and putting the ratchet back to a tool holder with three demonstrations.