Pose and Action Recognition (ukita)

人の姿勢・行動認識

カメラで人を撮影した画像には，姿勢や行動などの人の物理的な状態を表す様々な情報が含まれる．中でも，人の姿勢は人間の振る舞いや身体能力を表す基本情報であり，画像認識によりその推定が可能になれば，その推定が可能になれば，多種多様な応用システムが可能になる．本研究では，デジカメやデジタルビデオで誰でも撮影できるような一般的な画像中から，人体の姿勢を推定し，さらにその推定姿勢に基づいてその人間の行動を識別することを目標とする．

過去・現在の具体的な研究テーマを以下に示す．

識別学習された局所方位輪郭特徴量によりパーツ連結性を評価した人体姿勢推定
人体パーツと背景を２値分割した領域特徴量による人体姿勢推定
人体姿勢推定における効率的な学習のための学習サンプル選択
Iterative Action and Pose Recognition using Global-and-Pose Features and Action-specific Models
Occluded Appearance Modeling with Sample Weighting for Human Pose Estimation
Ensemble Convolutional Neural Networks for Pose Estimation
Semi- and Weakly-supervised Human Pose Estimation

・識別学習された局所方位輪郭特徴量によりパーツ連結性を評価した人体姿勢推定

This paper proposes contour-based features for articulated pose estimation. in single images. Most of recent methods, including general object recognition using graphical models, are designed using tree-structured models with appearance evaluation only within the region of each part. While these models allow us to speed up global optimization in localizing the whole parts, useful appearance cues between neighboring parts are missing. Our work focuses on how to evaluate parts connectivity using contour cues (Fig. 1). Unlike previous works, we locally evaluate parts connectivity only along the orientation between neighboring parts within where they overlap. This adaptive localization of the features is required for suppressing bad effects due to nuisance edges such as those of background clutter and clothing textures, as well as for reducing computational cost. Discriminative training of the contour features improves estimation accuracy more. Experimental results with a public dataset of people images verify the effectiveness of our contour-based features (Fig. 2).

**図１：** （左）ベース手法の結果．左腕の推定に失敗．（中央）輪郭画像．提案手法では，胴体から左腕に繋がる輪郭を評価する．（右）提案手法の結果．左腕の推定に成功．

**図２：** 提案手法による姿勢推定結果（左，中央）．ベース手法の結果（右）．
画像下の数字は，推定に成功した人体パーツの割合を示す．

・人体パーツと背景を２値分割した領域特徴量による人体姿勢推定

本研究では，複雑背景下で撮影された一枚の静止画像中から人体の姿勢を推定する手法を提案する．この問題では，画像中の各ウィンドウにおける人体の各パーツ(頭，胴，腕など) らしさ，すなわちウィンドウの見えとパーツの見えの類似度を正しく評価することが重要となる．ほとんどの従来手法では，濃度勾配ベースの特徴量のみを用いて各パーツの類似度を計算するため，個人差に依存しにくい類似度評価のために重要な人体輪郭だけでなく，服表面のテクスチャや複雑な背景模様も検出されてしまう問題があった．本研究ではこの問題を解決するために，パーツ類似度評価のための2 段階の処理を新たに提案する．まず第1 に，人体のパーツ領域をより正しく抽出するためのパーツ・背景領域分割法を提案する．この方法では各パーツの候補領域において，各パーツ形状の事前知識を参照したパーツ領域と背景領域を２値化をすることによって，人体輪郭だけを領域特徴量として抽出し，この領域特徴量に基づいて各パーツとの類似度を評価する．第2に，領域特徴量と従来の濃度勾配ベースの特徴量とを統合することにより，両特徴量を相補的に利用したパーツ類似度の評価法を提案する．実験の例を図３に示す．

**図３：** 提案手法による姿勢推定結果（左）．ベース手法の結果（右）．
画像下の数字は，推定に成功した人体パーツの割合を示す．

・人体姿勢推定における効率的な学習のための学習サンプル選択

認識のためのモデル学習では，サンプルが多いほど認識性能は向上するが，学習の計算コストはサンプル数に応じて増大してしまう．そこで，学習サンプルを適切に選択し，推定正答率を大きく落とすことのない学習の高速化を提案する．本研究では，人体姿勢推定を研究対象とする．提案法は，一般的な認識問題に適用可能なフレームワークに加え，人体姿勢特徴量に特化した低次元化による高速化も備える点で，従来の手法と異なる．本稿では，サンプルの選択法を３つ提案する．一つはクラスタリングによる選択で，重複的な学習を避ける方法である．二つめは識別境界からの距離による選択で，識別境界の効率的な更新に着目した手法である．三つめは，前述の２つの手法を組み合わせ，時間のかかるクラスタリングと誤検出探索において，人体姿勢推定問題に特化した特徴量の低次元化と枝刈りを加えた．これらの手法について実験を行い，実験の結果，統計的なパラメータに基づいた人体姿勢推定の正答率の低下を3%以下に抑えたまま学習時間を79%削減できた．提案手法による高速学習で適切なモデルを選択し，そのモデルを全サンプルで学習することによって，全体の学習時間の大幅削減と従来とおりの高認識率を両立できる（図４）．

**図４：** 提案法による学習時間削減．赤棒が全サンプルを学習して最高のモデルを得る計算時間．橙棒が，提案法により選択サンプルのみを学習して最高モデルを選び，最高モデルを全サンプルで学習する計算時間．

・Iterative Action and Pose Recognition using Global-and-Pose Features and Action-specific Models

This work proposes an iterative scheme between human action classification and pose estimation in still images (Fig. 5). For initial action classification, we employ global image features that represent a scene (e.g. people, background, and other objects), which can be extracted without any difficult human-region segmentation such as pose estimation. This classification gives us the probability estimates of possible actions in a query image. The probability estimates are used to evaluate the results of pose estimation using action-specific models. The estimated pose is then merged with the global features for action re-classification. This iterative scheme can mutually improve action classification and pose estimation, as illustrated in Fig. 6. Experimental results with a public dataset demonstrate the effectiveness of global features for initialization, action-specific models for pose estimation, and action classification with global and pose features.

**図５：** Pose representation with 10 body parts.

**図６：** Overview of the proposed method. Red, green, and black arrows depict the data flows of feature vectors, model parameters, and estimated values, respectively. The model parameters are employed with the features for action classification and pose estimation, which are performed in left and right dotted rectangles, respectively. Pose features, which are produced from estimated poses, are fed back to be merged with global image features (i.e. Object Bank feature) for iterative action classification. After iteration between action classification and pose estimation is finished, their final results are determined, as depicted by blue arrows.

・Occluded Appearance Modeling with Sample Weighting for Human Pose Estimation

This work proposes a method for human pose estimation in still images. The proposed method achieves self-occlusion-aware appearance modeling. Appearance modeling with less accurate appearance data is problematic because it adversely affects the entire training process. The proposed method evaluates the effectiveness of mitigating the influence of occluded body parts in training sample images. In order to improve occlusion evaluation by a discriminatively-trained model, occlusion images are synthesized and employed with non-occlusion images for discriminative modeling. The score of this discriminative model is used for weighting each sample in the training process (図７). Experimental results demonstrate that our approach improves the performance of human pose estimation in contrast to base models.

**図７：** Illustration of our proposed approach. From sample images with parts annotation, part windows are cropped and divided into non-occluded and occluded parts. These part windows are employed for learning the occlusion confidence model. This model weights each sample image in learning the pose estimation model. Processes enclosed by a dashed rectangle are executed for each body part.

・Ensemble Convolutional Neural Networks for Pose Estimation

Human pose estimation is a challenging task due to significant appearance variations. An ensemble of models, each of which is optimized for a limited variety of poses, is capable of modeling a large variety of human body configurations. However, ensembling models is not a straightforward task due to the complex interdependence among noisy and ambiguous pose estimation predictions acquired by each model. We propose to capture this complex interdependence using a convolutional neural network. Our network achieves this interdependence representation using a combination of deep convolution and deconvolution layers for robust and accurate pose estimation (図８). We evaluate the proposed ensemble model on publicly available datasets and show that our model compares favorably against baseline models and state-of-the-art methods.

**図８：** 上部：Training stage, 下部：Testing stage．

・Semi- and Weakly-supervised Human Pose Estimation

For human pose estimation in still images, this paper proposes three semi- and weakly-supervised learning schemes. While recent advances of convolutional neural networks improve human pose estimation using supervised training data, our focus is to explore the semi- and weakly-supervised schemes. Our proposed schemes initially learn conventional model(s) for pose estimation from a small amount of standard training images with human pose annotations. For the first semi-supervised learning scheme, this conventional pose model detects candidate poses in training images with no human annotation. From these candidate poses, only true-positives are selected by a classifier using a pose feature representing the configuration of all body parts. The accuracies of these candidate pose estimation and true-positive pose selection are improved by action labels provided to these images in our second and third learning schemes, which are semi- and weakly-supervised learning. While the first and second learning schemes select only poses that are similar to those in the supervised training data, the third scheme selects more true-positive poses that are significantly different from any supervised poses (図９). This pose selection is achieved by pose clustering using outlier pose detection with Dirichlet process mixtures and the Bayes factor. The proposed schemes are validated with large-scale human pose datasets.

**図９：** Overview of the proposed method. Images each of which has a pose annotation and an action label (i.e., Figure (a)) are used to acquire initial action-specific pose models (i.e., (b)). Each model is acquired from images along with its respective action label. a-th action-specific pose model is used for estimating human poses in training images with the label of a-th action (i.e., (c)). Each estimated pose (i.e., (d)) is evaluated whether or not it is true-positive. This evaluation is achieved by a pose feature representing the configuration of all body parts (i.e., (e)). True-positive poses are selected also by outlier detection using Dirichlet process mixtures (i.e., (f)). These true-positive poses are employed as pose annotations (i.e., (g)) for re-learning the action-specific pose models. After iterative re-learning, all training images with pose annotations (i.e., (a) and (g)) are used for learning a final pose model (i.e., (h)).