Technical Program

Program

Time

Sunday, September 21

19:00 Networking Reception at Novotel Hotel

Monday, September 22

08:50 Welcome Remarks
09:00 Keynote 1 by Ton Kalker: Multimedia Security: Technology and Society
10:00 Oral 1: 3D Video Analysis
11:00 Morning Break
11:30 Poster 1: Multimedia Processing Systems, and Applications
12:30 Lunch & MMSP TC
14:00 Overview 1 by Nadia Thalmann : Autonomous Virtual Humans and Social Robots in Telepresence
15:00 Oral 2: Classification and Recommendation
16:00 Afternoon Break
16:30 Panel 1: Multimedia security : Technology and Society
19:00 Welcome Reception at Rasane Restaurant

Tuesday, September 23

09:00 Keynote 2 by Sethuraman Panchanathan: Person-Centered Multimedia Computing: A New Paradigm Inspired by Assistive and Rehabilitative Application
10:00 Oral 3: Event Detection and Recognition
11:00 Morning Break
11:30 Poster 2: Multimedia Analysis & Retrieval Sketch & Demo Session
12:30 Lunch Break
14:00 Overview 2 by Ramesh Jain : Towards Smart Social Systems
15:00 Oral 4: Scalable Coding
16:00 Afternoon Break
16:30 Panel 2: Wearable multimedia computing
19:00 Banquet at Padang Golf Modern

Wednesday, September 24

09:00 Keynote 3 by Wen Gao: A New Domain-oriented Video Coding
10:00 Oral 5: HEVC
11:00 Morning Break
11:30 Poster 3 : Visual Coding
12:30 Lunch Break
14:00 Overview 3 by Ioannis Pitas : 2D/3D AudioVisual Content Analysis & Description
15:00 Special Session: Quality Assessment
16:00 Afternoon Break
16:30 Special Session: Quality Assessment (Cont)
17:10 Closing Remarks

Sunday, September 21

19:00 – 21.00

Networking Reception at Novotel Hotel

 

Monday, September 22

8:50 – 09:00

Welcome Remarks

9:00 – 10:00

Keynote 1: Multimedia Security : Technology and Society

 

Ton Kalker, DTS, Inc., USA

Chair: Zhengyou Zhang, Microsoft Research, USA

 

10:00 – 11:00

Oral 1 – 3D Video Analysis

Chair: Gene Cheung, National Institute of Informatics, Japan
Estimating Spatial Layout of Rooms from RGB-D Videos
Anran Wang, Nanyang Technological University; Jiwen Lu, Advanced Digital Sciences Center, Singapore; Jianfei Cai, Nanyang Technological Universit, Singaporey; Gang Wang, Nanyang Technological University, Singapore; Tat-Jen Cham, Nanyang Technological University, SIngapore
Spatial layout estimation of indoor rooms plays an important role in many visual analysis applications such as robotics and human-computer interaction. While many methods have been proposed for recovering spatial layout of rooms in recent years, their performance is still far from satisfactory due to high occlusion caused by the presence of objects that clutter the scene. In this paper, we propose a new approach to estimate the spatial layout of rooms from RGB-D videos. Unlike most existing methods which estimate the layout from still images, RGB-D videos provide more spatial-temporal and depth information, which are helpful to improve the estimation performance because more contextual information can be exploited in RGB-D videos. Given a RGB-D video, we first estimate the spatial layout of the scene in each single frame and compute the camera trajectory using the simultaneous localization and mapping (SLAM) algorithm. Then, the estimated spatial layouts of different frames are integrated to infer temporally consistent layouts of the room throughout the whole video. Our method is evaluated on the NYU RGB-D dataset, and the experimental results show the efficacy of the proposed approach.
Efficient Automatic Detection of 3D Video Artifacts
Mohan Liu, Fraunhofer HHI, Germany; Patrick Ndjiki-Nya, Fraunhofer HHI Image Processing, Germany; Ioannis Mademlis, AUTH, Greece; Nikos Nikolaidis, AUTH, Greece; Ioannis Pitas, University of Thessaloniki, Greece; Jean-Charles Le Quintrec, ARTE, Tufalu
This paper summarizes some common artifacts in stereo video content. These artifacts lead to poor even uncomfortable 3D viewing experience. Efficient approaches for detecting three typical artifacts, sharpness mismatch, synchronization mismatch and stereoscopic window violation, are presented in detail. Sharpness mismatch is estimated by measuring the width deviations of edge pairs in depth planes. Synchronization mismatch is detected based on the motion inconsistencies of feature points between the stereoscopic channels in a short time frame. Stereoscopic window violation is detected, using connected component analysis, when objects hit the vertical frame boundaries while being in front of the virtual screen. For experiments, test sequences were created in professional studio environments and state-of-the-art metrics were used for evaluating the proposed approaches. The experimental results show that our algorithms have considerable robustness in detecting 3D defects.
Shot type characterization in 2D and 3D video content
Ioannis Tsingalis, Aristotle University, Greece; AnastasiosTefas, Aristotle University of Thessaloniki, Greece; Nikos Nikolaidis, AUTH, Greece; Ioannis Pitas, University of Thessaloniki, Greece
Due to the enormous increase of video and image content on the web the last decodes, automatic video annotation became a necessity. The successful annotation of video and image content help in a successful indexing and retrieval in search machines. In this work we study a variety of possible shot type characterizations that can be assigned in a single video frame or still image. Also possible ways to propagate these characterizations to a video segment(or to an entire shot) are also discussed. Finally, in the case of 3D (stereo) video, the disparity information is used to detect certain shot types (e.g. over the shoulder ones). According to our knowledge there is no such other relevant method in the literatur.

11:00 – 11:30

Morning Break

11:30 – 12:30

Poster Session 1: Multimedia Processing Systems, and Applications

Chair: Mukti Budiarto, STMIK Raharja, Indonesia.
A hybrid approach to animating the murals with Dunhuang style
Bingwen Jin, Zhejiang University, China; Linglong Feng, Zhejiang University, China; Gang LIU, Dunhuang Academy, China; Huaqing Luo, Dunhuang Academy, China; Weidong Geng, Zhejiang University, China
In order to animate the valuable murals of Dunhuang Mogao Grottoes, we propose a hybrid approach to creating the animation with the artistic style in the murals. Its key point is the fusion of 2D and 3D animation assets, for which a hybrid model is constructed from a 2.5D model, a 3D model, and registration information. The 2.5D model, created from 2D multi-view drawings, is composed of 2.5D strokes. For each 2.5D stroke, we let the user draw corresponding strokes on the surface of the 3D model in multiple views. Then the method automatically generates registration information, which enables 3D animation assets to animate the 2.5D model. At last, the animated line drawings are produced from 2.5D and 3D models respectively and blended under the control of per-stroke weights. The user can manually modify the weights to get the desired animation style.
Vision-based Tracking in Large Image Database for Real-Time Mobile Augmented Reality
Madjid Maidi, Telecom SudParis, France; Marius Preda, Telecom SudParis, France; Yassine Lehiani, Telecom SudParis, France; Traian Lavric, Telecom SudParis, France
This paper presents an approach for tracking natural objects in augmented reality applications. The targets are detected and identified using a markerless approach relying upon the extraction of image salient features and descriptors. The method deals with large image databases using a novel strategy for feature retrieval and pairwise matching. Furthermore, the developed method integrates a real-time solution for 3D pose estimation using an analytical technique based on camera perspective transformations. The algorithm associates 2D feature samples coming from the identification part with 3D mapped points of the object space. Next, a sampling scheme for ordering correspondences is carried out to establishing the inner 2D/3D projective relationship. The tracker performs localization using the feature images and 3D models to enhance the scene view with overlaid graphics by computing the camera motion parameters. The modules built within this architecture are deployed on a mobile platform to provide an intuitive interface for interacting with the surrounding real world. The system is experimented and evaluated on challenging scalable image dataset and the obtained results demonstrate the effectiveness of the approach towards versatile augmented reality applications.
A fusion-based enhancing approach for single sandstorm image
Xueyang Fu, Xiamen University, China; Yue Huang, Xiamen University, China; Delu Zeng, China ; Xiao-Ping Zhang, Ryerson University, Canada; Xinghao Ding, Xiamen University, China
In this paper, a novel image enhancing approach focuses on single sandstorm image is proposed. The degraded image has some problems, such as color distortion, low-visibility, fuzz and non-uniform luminance, due to the light is absorbed and scattered by particles in sandstorm. The proposed approach based on fusion principles aims to overcome the aforementioned limitations. First, the degraded image is color corrected by adopting a statistical strategy. Then two inputs, which represent different brightness, are derived only from the color corrected image by applying Gamma correction. Three weighted maps (sharpness, chromaticity and prominence), which contain important features to increase the quality of the degraded image, are computed from the derived inputs. Finally, the enhanced image is obtained by fusing the inputs with the weight maps. The proposed method is the first to adopt a fusion-based method for enhancing single sandstorm image. Experimental results show that enhanced results can be improved by color correction, well enhanced details and local contrast while promoted global brightness, increasing the visibility, naturalness preservation. Moreover, the proposed algorithm is mostly calculated by per-pixel operation, which is appropriate for real-time applications.
Robust Mixed Noise Removal with Non-parametric Bayesian Sparse Outlier Model
Peixian Zhuang, Xiamen University, China; Wei Wang, School of Information Science and Engineering, Xiamen University, China; Delu Zeng, China; Xinghao Ding, Xiamen University, China
This paper proposes a novel non-parametric Bayesian framework for solving mixed noise removal problem. In order to remove unstable effects of outlier noise such as salt-and-pepper in the training data, we decompose the observed data model into three components terms of ideal data, Gaussian noise and sparse outlier. And the proposed model employs spike-slab sparse prior to find the sparser coefficients of desired data term and outlier noise. Note that the proposed non-parametric Bayesian model can infer the noise statistics from the training data and have been robust to the mixed noise without tuning of model parameters. Experimental results demonstrate the proposed algorithm performs well with mixed noise and achieves better performance over other state-of-the-art methods.
Stable Pose Tracking from a Planar Target with an Analytical Motion Model in Real- time Applications
Po-Chen Wu, National Taiwan University, Taiwan; Yao-Hung Tsai, National Taiwan University, Taiwan; Shao-Yi Chien, National Taiwan University, Taiwan
Object pose tracking from a camera is a well-developed method in computer vision. In theory, the pose can be determined uniquely from a calibrated camera. However, in practice, most real-time pose estimation algorithms experience pose ambiguity. We consider that pose ambiguity, i.e., the detection of two distinct local minima according to an error function, is caused by a geometric illusion. In this case, both ambiguous poses are plausible, but we cannot select the pose with the minimum error as the final pose. Thus, we developed a real-time algorithm for correct pose estimation for a planar target object using an analytical motion model. Our experimental results showed that the proposed algorithm effectively reduced the effects of pose jumping and pose jittering. To the best of our knowledge, this is the first approach to address the pose ambiguity problem using an analytical motion model in real-time applications.
Design of Extrafine Complex Directional Wavelet Transform and Application to Image Denoising
Shrishail Gajbhar, DA-IICT, Gandhinagar, India; Manjunath Joshi, DA-IICT, Gandhinagar, India
In this paper, we propose decimated and undecimated designs of extrafine complex directional wavelet transform (EFiCDWT) having 12 highpass directional subbands at each scale. EFiCDWT is obtained using a new mapping-based complex wavelet transform (CWT) followed by a complex-valued filter bank (FB) stage. The FB stage having 2-D prototype complex FIR filters designed using complex transformations, finely decompose the 6 complex directional subbands of CWT to have extra directionality. Our design on decimated EFiCDWT is near shift-invariant with redundancy factor of 2 (due to complex coefficients) while undecimated design is completely shift-invariant and hence useful for image denoising. Main advantage of the proposed designs is their directional extensibility with possible generalized separable implementations. The proposed designs are tested for image denoising application using simple hard-thresholding scheme and they show better denoising performance.
Video Super-Resolution via Dynamic Texture Synthesis
Chih-Chung Hsu, National Tsing Hua University, Taiwan; Li-Wei Kang, National Yunlin University of Science and Technology, Taiwan; Chia-Wen Lin, , Taiwan
This paper addresses the problem of hallucinating the missing high-resolution (HR) details of a low-resolution (LR) video while maintaining the temporal coherence of the hallucinated HR details by using dynamic texture synthesis (DTS). Most existing multi-frame-based video super-resolution (SR) methods suffer from the problem of limited reconstructed visual quality due to inaccurate sub-pixel motion estimation between frames in a LR video. To achieve high-quality reconstruction of HR details for a LR video, we propose a texture-synthesis-based video super-resolution method, in which a novel DTS scheme is proposed to render the reconstructed HR details in a time coherent way, so as to effectively address the temporal incoherence problem caused by traditional texture synthesis based image SR methods. To further reduce the complexity of the proposed method, our method only performs the DTS-based SR on a selected set of key-frames, while the HR details of the remaining non-key-frames are simply predicted using the bi-directional overlapped block motion compensation. Experimental results demonstrate that the proposed method achieves significant subjective and objective quality improvement over state-of-the-art video SR methods.
Cost Effective Video Streaming Using Server Push over HTTP 2.0
Sheng Wei, Adobe Systems Inc., USA; Viswanathan Swaminathan, Adobe Systems Inc., USA
The Hypertext Transfer Protocol (HTTP) has been widely adopted and deployed as the key technology for video streaming over the Internet. One of the consequences of leveraging traditional HTTP for video streaming is the significantly increased request overhead due to the segmentation of the video content into HTTP resources. The overhead becomes even more significant when non-multiplexed video and audio segments are deployed for the consideration of lowering storage cost and enabling alternate audio tracks. In this paper, we investigate and address the request overhead problem by employing the server push technologies in the new HTTP 2.0 protocol. In particular, we develop a set of push strategies that actively deliver video and audio contents from the HTTP server without requiring a request for each individual segment. We evaluate our approach in a Dynamic Adaptive Streaming over HTTP (DASH) streaming system. We show that the request overhead can be significantly reduced by using our push strategies. Also, we validate that the server push-based approach is compatible with the existing HTTP streaming features, such as adaptive bitrate switching.
Within- and Cross- Database Evaluations for Gender Classification via BeFIT Protocols
Nesli Erdogmus, Idiap Research Institute, Switzerland; Matthias Vanoni, Idiap Research Institute, Switzerland; Sebastien Marcel, Idiap Research Institute, Switzerland
With its wide range of applicability, gender classification is an important task in face image analysis and it has drawn a great interest from the pattern recognition community. In this paper, we aim to deal with this problem using Local Binary Pattern Histogram Sequences as feature vectors in general. Differently from what has been done in similar studies, the algorithm parameters used in cropping and feature extraction steps are selected after an extensive grid search using BANCA and MOBIO databases. The final system which is evaluated on FERET, MORPH-II and LFW with gender balanced and imbalanced training sets is shown to achieve commensurate and better results compared to other state-of-the-art performances on those databases. The system is additionally tested for cross- database training in order to assess its accuracy in real world conditions. For both within- and cross-database experiments on LFW and MORPH-II, the BeFIT protocols are utilized.
Extrinsic Calibration for Wide-Baseline RGB-D Camera Network
Ju Shen, University of Kentucky, USA; Wanxin Xu, University of Kentucky, USA; Ying Luo, University of Kentucky, USA; Po-chang Su, University of Kentucky, USA; Samson Cheung, University of Kentucky, USA
In the recent years, color and depth camera systems have attracted intensive attention because of its wide applications in image-based rendering, 3D model reconstruction, and human tracking and pose estimation. These applications often require multiple color and depth cameras to be placed with wide separation so as to capture the scene objects from different prospectives. The difference in modality and the wide baseline make calibration a challenging problem. In this paper, we present an algorithm that simultaneously and automatically calibrates the extrinsics across multiple color and depth cameras across the network. Rather than using the standard checkerboard, we use a sphere as a calibration object to identify the correspondences across different views. We experimentally demonstrate that our calibration framework can seamlessly integrate different views with wide baselines that outperforms other techniques in the literature.

12:30 – 14.00

Lunch & MMSP LC

14:00 – 15.00

Overview 1: Autonomous Virtual Humans and Social Robots in Telepresence

Nadia Thalman, Institute for Media Innovation, Nanyang Technological University, Singapore

Chair: Lekha Chaisorn, National University of Singapore, Singapore

Oral 2: Classification and Recommendation

Chair: Nam Ling, Santa Clara University, USA
View-invariant Feature Discovering for Multi-camera Human Action Recognition
Hong Lin, Tianjin University, China; Lekha Chaisorn, NUS, Singapore; Yongkang Wong, China; An- An Liu, Tianjin University China; Yu-Ting Su, Tianjin University, China; Mohan Kankanhalli, National University of Singapore, Singapore
Video surveillance system is built to automatically detect the events of interest, especially on object tracking and behavior understanding. In this paper, we focus on the task of human action recognition under a surveillance environment, specifically in a multi-camera monitoring scene. Although many approaches achieve success in recognizing human action from video sequences, they are designed for single view and not robust to viewpoint change. Human action recognition under multi-view environment still remains challenging due to the large variations from one view to another. We propose a novel method that can solve the problem of transferring action models learned in one view (source view) to another view (target view). First, Local space-time interest point feature and global shape-flow feature are extracted as low-level feature. Then, the hybrid Bag-of-Words model is built for each action video. The data distribution of relevant actions from source view and target view are linked via a cross-view discriminative dictionary learning method. Through the view-adaptive dictionary pair learned by the method, the data from source and target view can be respectively mapped into a common space which is view-invariant. Furthermore, we extend our algorithm to transfer action models from multiple views to one view when there are multiple source views available. We have tested our method on the IXMAS human action dataset with viewpoint change, and it has demonstrated that our method is very effective.
SVM Is Not Always Confident: Telling Whether the Output from Multiclass SVM Is True or False by Analysing Its Confidence Values
Toshihiko Yamasaki, University of Tokyo, Japan; Takaki Maeda, Japan; Kiyoharu Aizawa, Japan
This paper presents an algorithm to distinguish whether the output label that is yielded from multiclass support vector machine (SVM) is true or false without knowing the answer. Such judgment is done only by the confidence analysis based on the pre-training/testing using the training data. Our true/false judgment is useful for refining the outputs. We experimentally demonstrate that the decision value difference between the top candidate and the second candidate is a good measure. In ad-dition, a proper threshold can be determined by the pre-training/testing using only the training data. Experimental results using three standard image datasets demonstrate that our proposed algorithm can improve Matthews correlation coefficient (MCC) much better than thresholding the decision value for the top candidate.
Music Recommendation Based on Artist Novelty and Similarity
Ning Lin, National Taiwan University, Taiwan; Ping-Chia Tsai, National Taiwan University, Taiwan; Yu-An Chen, National Taiwan University, Taiwan; Homer H. Chen, National Taiwan University, Taiwan
Most existing systems recommend songs to the user based on the popularity of songs and singers. However, the system proposed in this paper is driven by an emerging and somewhat different need in the music industry—promoting new talents. The system recommends songs based on the novelty of singers (or artists) and their similarity to the user’s favorite artists. Novel artists whose popularity is on the rise have a higher priority to be recommended. Specifically, given a user’s favorite artists, the system first determines the candidate artists based on their similarity with the favorite artists and then selects those who have a higher novelty score than the favorite artists. Then, the system outputs a playlist composed of the most popular songs of the selected artists. The proposed system can be integrated into most existing systems. Its performance is evaluated using the Spotify Radio Recommender as a reference and a pool of 100 subjects recruited on campus. Experimental results show that our system achieves a high novelty score and a competitive user-preference score.
 

16:00 – 16.30

Afternoon Break

16:30 – 17.30

Panel 1: Multimedia security : Technology and Society

Chair: Ton Kalker, DTS, Inc., USA
Panelists:
Touradj Ebrahimi, EPFL, Switzerland

Oscar Au, HKUST, Hong Kong

19:00 – 21.00

Welcome Reception at Rasane Restaurant

Tuesday, September 23

9:00 – 10:00

Keynote 2: Person-Centered Multimedia Computing: A New Paradigm Inspired by Assistive and Rehabilitative Application

Sethuraman Panchanathan, Arizona State University, USA

Chair: Susanto Rahardja, National University of Singapore, Singapore

10:00 – 11:00

Oral 3: Event Detection and Recognition

Chair: Touradj Ebrahimi, EPFL, Switzerland
Graph-based Depth Video Denoising and Event Detection for Sleep Monitoring
Cheng Yang, University of Strathclyde, England; Yu Mao, National Institute of Informatics, Japan; Gene Cheung, National Institute of Informatics, Japan; Vladimir Stankovic, University of Strathclyde, England; Kevin Chan, USA
Quality of sleep greatly affects a person’s physiological well-being. Traditional sleep monitoring systems are expensive in cost and intrusive enough that they disturb the natural sleep of clinical patients. In our previous work, we proposed a non-intrusive sleep monitoring system to first record depth video in real-time, then offline analyze recorded depth data to track a patient’s chest and abdomen movements over time. Detection of abnormal breathing is then interpreted as episodes of apnoea or hypopnoea. Leveraging on recent advances in graph signal processing (GSP), in this paper we propose two new additions to further improve our sleep monitoring system. First, temporal denoising is performed using a block motion vector smoothness prior expressed in the graph-signal domain, so that unwanted temporal flickering can be removed. Second, a graph-based event classification scheme is proposed, so that detection of apnoea / hypopnoea can be performed accurately and robustly. Experimental results show first that graph-based temporal denoising scheme outperforms bilateral filter in terms of flicker removal. Second, we show that our graph-based event classification scheme is noticeably more robust to errors in training data than two conventional implementations of support vector machine (SVM).
Gaze Direction Estimation From Static Images
Krystian Radlak, Silesian University of Technol, Poland; Michal Kawulok, Poland; Bogdan Smolka, Poland; Natalia Radlak, Poland
This study presents a novel multilevel algorithm for gaze direction recognition from static images. Proposed solution consists of three stages: (i) eye pupil localization using a multi-stage ellipse detector combined with a support vector machines verifier, (ii) eye bounding box localization calculated using a hybrid projection function and (iii) gaze direction classification using support vector machines and random forests. The proposed method has been tested on Eye-Chimera database with very promising results. Extensive tests show that eye bounding box localization allow us to achieve highly accurate results both in terms of eye localization and gaze direction classification.
Soccer Video Summarization based on Cinematography and Motion Analysis
Ngoc Nguyen, JAIST, Japan; Atsuo Yoshitaka, Jaist, Japan
Summarization of soccer videos has been widely studied due to its worldwide viewers and potential commercial applications. Most existing methods focus on searching for highlight events in soccer videos such as goals, penalty kicks and generating a summary as a list of such events. However, besides highlight events, scenes of intensive competition between players of two teams and emotional moments are also interesting. In this paper, we propose a soccer summarization system which is able to capture highlight events, scenes of intensive competition, and emotional moments. Based on the flow of soccer games, we organize a video summary as follows: first, scenes of intensive competition, second, what events happened, third, who were involved in the events, and finally how players or audience reacted to the events. With this structure, the generated summary is more complete and interesting because it provides both game play and emotional moments. Our system takes broadcast video as input, and divides it into multiple clips based on cinematographic features such as sport video production techniques, the transition of shots, and camera motions. Then, the system evaluates the interest level of each clip to generate a summary. Experimental results and subjective evaluation are carried out to evaluate the quality of the generated summary and the effectiveness of our proposed interest level measure.
 
 

11:00 – 11:30

Morning Break

11:30 – 12:30

Poster 2: Multimedia Analysis & Retrieval

Chairs: Yuliana Isma Graha, STMIK Raharja, Indonesia.
Skeleton-Guided Vectorization of Chinese Calligraphy Images
Wanqiong Pan, Peking University, China; Zhouhui Lian, Institute of Computer Science and Technology, Peking University, China; Yingmin Tang, Institute of Computer Science and Technology, Peking University, China; Jianguo Xiao, Institute of Computer Science and Technology, Peking University, China
How to automatically generate compact and high-quality vectorization for Chinese calligraphy images is a challenging problem, since these images usually suffer from noisy contours and discontinuous strokes. In this paper, we propose a skeleton guided approach to vectorize Chinese calligraphy images. Since the skeleton reflects the writing trace and it is less influenced by contour noises, our method could extract the important writing style from the noisy contours. Specifically, in our method, the calligraphy image is first preprocessed by binarization and denoising. Then salient contour points are detected by a novel algorithm. Afterwards, under the guidance of skeleton information, the salient points are classified into corner points and joint points. Finally, a dynamic curve fitting procedure is applied to generate the vectorization result. Experimental results demonstrate that our skeleton-guided approach could automatically distinguish tiny features from contour noises and thus obtains more visually satisfactory performance compared to other existing methods.
GrabCut-Based Abandoned Object Detection
Kahlil Muchtar, National Sun Yat-sen University, Taiwan; Chih-Yang Lin, Asia University, Taiwan; Chia-Hung Yeh, National Sun Yat-sen University, Taiwan
This paper presents a detection-based method to subtract abandoned objects from a surveillance scene. Unlike tracking-based approaches that are commonly complicated and unreliable on a crowded scene, the proposed method employs background (BG) modelling and focuses only on immobile objects. The main contribution of our work is to build an abandoned object detection system which is robust and can resist interference (shadow, illumination changes and occlusion). In addition, we introduce the MRF model and shadow removal to our system. MRF is a promising way to model neighbours’ information when labeling the pixel that is either set to background or abandoned objects. It represents the correlation and dependency in a pixel and its neighbours. By incorporating the MRF model, as shown in the experimental part, our method can efficiently reduce the false alarm. To evaluate the system’s robustness, several dataset including CAVIAR datasets and outdoor test cases are both tested in our experiments.
A Pilot Study on Affective Classification of Facial Images for Emerging News Topics
Ligang Zhang, Xi’an University of Technology, China; Andy Cher Han Lau, Queensland University of Technology, Australia; Dian Tjondronegoro, Queensland University of Technology, Australia; Vinod Chandran, Queensland University of Technology, Australia
The proliferation of news reports published in online websites and news information sharing among social media users necessitates effective techniques for analysing the image, text and video data related to news topics. This paper presents the first study to classify affective facial images on emerging news topics. The proposed system dynamically monitors and selects the current hot (of great interest) news topics with strong affective interestingness using textual keywords in news articles and social media discussions. Images from the selected hot topics are extracted and classified into three categorized emotions, positive, neutral and negative, based on facial expressions of subjects in the images. Performance evaluations on two facial image datasets collected from real-world resources demonstrate the applicability and effectiveness of the proposed system in affective classification of facial images in news reports. Facial expression shows high consistency with the affective textual content in news reports for positive emotion, while only low correlation has been observed for neutral and negative. The system can be directly used for applications, such as assisting editors in choosing photos with a proper affective semantic for a certain topic during news report preparation.
NMF-based Multiple Pitch Estimation Using Sparseness and Inter-frame Continuity Constraints
Takanori Fujisawa, Keio University, Japan; Ikuo Degawa, Keio University, Japan; Masaaki Ikehara, Keio University., Japan
This paper proposes NMF-based (non-negative matrix factorization) multiple pitch estimation algorithm. The approach of NMF-based multiple pitch estimation is to decompose input magnitude spectrogram into sum of basis spectra representing individual pitches. In decomposing music signals, the amplitude of basis spectra should have sparseness, and the shape of amplitude should be continous between neighbor temporal frames. We introduce the constraint using matrix norm to enforce these characteristic at once and propose new NMF algorithm for spectral decomposition with this constraint. The evaluation of solo piano music shows this algorithm can implement more robust pitch estimation in the place which input spectrum has certain different shape from basis spectra or the shape of input spectrum has temporal change.
Social Image Search exploiting Joint Visual-Textual information within a Fuzzy Hypergraph Framework
Konstantinos Pliakos, Aristotle University of Thessaliniki, Greece; Constantine Kotropoulos, Aristotle University of Thessaloniki, Greece
The unremitting growth of social media popularity is manifested by the vast volume of images uploaded to the web. Despite the extensive research efforts, there are still open problems in accurate or efficient image search methods. The majority of existing methods, dedicated to image search, treat image visual content and semantic information captured by social image tags, separately or in a sequential manner. Here, a novel and efficient method is proposed, exploiting visual and textual information simultaneously. The joint visual-textual information is captured by a fuzzy hypergraph powered by the term-frequency and inverse-document-frequency (tf-idf) weighting scheme. Experimental results conducted on two datasets substantiate the merits of the proposed method. Indicatively, an average precision of 77% is measured at 1% recall for image-based queries.
Optimal Detector for Camera Model Identification Based on an Accurate Model of DCT Coeficients
Thanh Hai Thai, Troyes University of Technology, France; Remi Cogranne, Troyes University of Technology, France; Florent Retraint, Troyes University of Technology, France
The goal of this paper is to design a statistical test for the camera model identification problem. The approach is based on the state-of-the-art model of Discret Cosine Transform (DCT) coefficients to capture their statistical difference, which jointly results from different sensor noises and in-camera processing algorithms. The noise model parameters are considered as camera fingerprint to identify camera models. The camera model identification problem is cast in the framework of hypothesis testing theory. In an ideal context where all model parameters are perfectly known, this paper studies the optimal detector given by the Likelihood Ratio Test (LRT) and analytically establishes its statistical performances. In practice, a Generalized LRT is designed to deal with the difficulty of unknown parameters such that it can meet a prescribed false alarm probability while ensuring a high detection performance. Numerical results on simulated database and natural JPEG images highlight the relevance of the proposed approach.
Multi-View Action Recognition by Cross-domain Learning
Weizhi Nie, Tianjin University, China; An-An Liu, Tianjin University, China; Jing Yu, Tianjin University, China; Yu-Ting Su, Tianjin University, China; Lekha Chaisorn, NUS, Singapore; Yongkang Wong, China
This paper proposes a novel multi-view human action recognition method by discovering and sharing common knowledge among different video sets captured in multiple viewpoints. To our knowledge, we are the first to treat a specific view as target domain and the others as source domains and consequently formulate the multi-view action recognition into the cross-domain learning framework. First, the classic bag-of-visual word framework is implemented for visual feature extraction in individual viewpoints. Then, we propose a cross-domain learning method with block-wise weighted kernel function matrix to highlight the saliency components and consequently augment the discriminative ability of the model. Extensive experiments are implemented on IXMAS, the popular multi-view action dataset. The experimental results demonstrate that the proposed method can consistently outperform the state of the arts.
Local Stereo Matching Algorithm using Rotation-Skeleton-Based Region
Xing Li, Beijing Jiaotong University, China; Yao Zhao, China; Chun Yu Lin, China; Chao Yao, China
This paper proposes a local stereo matching algorithm for accurate disparity estimation by using 45◦ rotation-skeleton-based region (RSBR). For local stereo matching, an adaptive local region is important for the performance of the disparity estimation. In order to generate more accurate regions, we use a skeleton with 45 ◦ rotation to divide an initial window that achieved by mean-shift segmentation into four parts. All the pixels in the same height level are judged simultaneously, and the valid pixels in the four parts construct the 45◦ RSBR. Compared with the common local support region based on orthogonal skeleton, 45◦ RSBR enables to maintain the consistency in images and have the advantage of error tolerance. The local stereo matching algorithm with RSBR is improved in two aspects. First, the hybrid cost aggregation using the RSBR helps to remove some noise caused by outliers and improve the subjective performance. Second, the candidate values in the refinement step are selected from the RSBR, which ensures the validity of candidates. The experiment results demonstrate a good performance in Middlebury Stereo datasets, both in objective and subjective performance.
Classifying Harmful Children’s Content Using Affective Analysis
Joseph Santarcangelo, Ryerson University, Canada; Xiao-Ping Zhang, Ryerson University, Canada
This paper categorizes children’s videos according to an expertly assigned predefined positive or negative cognitive impact category. The method uses affective features to determine if a video belongs to an expertly assigned predefined positive or to a negative cognitive impact category. The work demonstrates that simple affective features outperform more complex systems in determining if content belongs to the positive or negative cognitive impact category. The work is tested on a set of videos that have been classified as having a short term or long term measurable negative or positive impact on cognition based on cited psychological literature. It found that affective analysis had superior performance using less features than state of the art video genre classification systems. It also found that arousal features performed better than valence features.
Background Subtraction Under Sudden Illumination Change
Hasan Sajid, University of Kentucky, USA; Samson Cheung, University of Kentucky, USA
In this paper, we propose a Multiple Background Model based Background Subtraction (MB2S) algorithm that is robust against sudden illumination changes in indoor environment. It uses multiple background models of expected illumination changes followed by both pixel and frame based background subtraction on both RGB and YCbCr color spaces. The masks generated after processing these input images are then combined in a framework to classify background and foreground pixels. Evaluation of proposed approach on publicly available test sequences show higher precision and recall than other state-of-the-art algorithms.

11:30 – 12:30

Sketch & Demo Session

Chair: Yuliana Isma Graha, STMIK Raharja, Indonesia.
Adaptive Intra Refresh Algorithm for HEVC Video Transmission
Htoo Maung, Chulalongkorn University, Thailand; Supavadee Aramvith, Chulalongkorn University, Thailand
With the help of new video coding standard, HEVC, and other advances in related technologies, transmission of HD video and ultra HD video over wireless network is expected to increase significantly in the near future. However, to maintain the good quality of perceived video is still quite challenging due to network congestion, delay, bandwidth
limitation, etc. Although HEVC has good compression ratio, error robustness is also very important especially for the wireless
environments. In this paper, we propose a feedback based adaptive intra refresh algorithm for error resilient video coding. Both picture level and slice level intra refresh algorithms are designed and tested with different bit rates.
Parametrization of 3D point clouds using Superquadrics
Marco Niehaus, Technical University of Ilmenau, Germany; Gerald Schuller, Technical University of Ilmenau, Germany
Sensor technology is becoming pervasive in our everyday lives, measuring the real world around us. The Internet of Things enables sensor devices to become active citizens of the Internet, while the Web of Things envisions interoperability between these devices and their services. An important problem remains the need for discovering these devices and services globally, in real-time, within acceptable time delays. Attempting to solve this problem using the existing Internet infrastructure, we explore the exploitation of the Domain Name System (DNS) as a scalable and ubiquitous directory mechanism for embedded devices. We examine the feasibility of this approach by performing a simulation involving up to one million embedded devices, to test system performance and scalability. Finally, we discuss about practical issues and the overall potential of this approach.
Free-viewpoint Video Synthesis for Sport Scenes Captured with a Single RGB-D Camera
Hiroshi Sankoh, KDDI R&D Laboratories Inc., Japan; Sei Naito, KDDI R&D Laboratories Inc., Japan
Semantic positioning is a new paradigm emerging with the Internet-of-Things (IoT) technology and its application to context-aware services in smart-spaces. Specifically, it refers to the problem of detecting user actions and locations based on prior characterization of the space and sensed data. Differently from classic positioning, input data are measurements of the interaction between human and sensors and location information is not a vector of coordinates but a a point in a topological map. In this paper, we tackle this challenge with a mere passive monitoring system in order to preserve user privacy, handle device heterogeneity, energy efficiency and utilizing low-complexity sensors that are able to capture events generated by human actions. We develop a structured sparsity model based on the notion of discrete Radon transforms on homogeneous space in order to construct mappings from events to actions and from actions to semantic locations. We propose algorithms for human activity detection and semantic positioning. Specifically, the Least Absolute Residual and Shrinkage Operator (LARSO) for human action detection, and a mixed-norm optimization to perform semantic positioning. Simulation results are shown to validate the proposed model and compare different algorithms.
 
 

12:30 – 14:00

Lunch Break

14:00 – 15.00

Overview 2: Towards Smart Social Systems

Ramesh Jain, University of California, Irvine, USA

Chair: Mohan Kankanhalli, National University of Singapore, Singapore

15:00 – 16.00

Oral 4: Scalable Coding

Chair: Byeungwoo Jeon, Sungkyunkwan University, South Korea
A New Error-Mapping Scheme for Scalable Audio Coding
Haibin Huang, Institute for Infocomm Research, Singapore; Susanto Rahardja, National University of Singapore, Singapore
In scalable audio coders, such as the MPEG-4 SLS, error-mapping is used to map quantization errors in the core coder to an error signal before passing through bit-plane coding. In this paper, we propose a new error-mapping scheme that is derived by observing statistical properties of the error signal. Compared with the error-mapping in SLS, the proposed scheme improves coding efficiency as well as computational complexity of the coder. An average improvement of 9 points in MUSHRA score has been achieved by the proposed scheme in subjective listening tests. The proposed error-mapping adds a useful new tool to the existing toolset for constructing next-generation scalable audio coders. The proposed scheme has also been evaluated in the AVS-2 Audio and found to be superior and suitable for the standards, and has therefore been adopted into the standards.
Bidirectional Hierarchical Anchoring of Motion Fields for Scalable Video Coding
Dominic Ruefenacht, UNSW, Australia; Reji Mathew, UNSW, Australia; David Taubman, UNSW, Australia
The ability to predict motion fields at finer temporal scales from coarser ones is a very desirable property for temporal scalability. This is at best very difficult in current state-of-the-art video codecs (i.e., H.264, HEVC), where motion fields are anchored in the frame that is to be predicted (target frame). In this paper, we propose to anchor motion fields in the reference frames. We show how from only one fully coded motion field at the coarsest temporal level as well as breakpoints which signal discontinuities in the motion field, we are able to reliably predict motion fields used at finer temporal levels. This significantly reduces the cost for coding the motion fields. Results on synthetic data show improved rate-distortion (R-D) performance and superior scalability, when compared to the traditional way of anchoring motion fields.
Embedded Coding of Optical Flow Fields for Scalable Video Compression
Sean Young, UNSW, Australia; Reji Mathew, UNSW, Australia; David Taubman, UNSW, Australia
An embedded coding scheme for dense motion (optical flow) fields is proposed. Such a scheme is particularly useful in scalable video compression where one must compensate for inter-frame motion at various visual qualities and resolutions. However, the high cost of coding such fields has often made this option prohibitive. Using our previously developed ‘breakpoint’- adaptive wavelet transform, we show that it is possible to code dense motion fields efficiently while simultaneously endowing the coded motion representation with embedded resolution and quality scalability attributes. Performance comparisons with the traditional non-scalable block-based model are also made and presented with the aid of a modified H.264/AVC JM reference encoder.
 
 

16:00 – 16:30

Afternoon Break

16:30 – 17.30

Panel 2 : Wearable multimedia computing

Chair: Homer Chen, National Taiwan University, Taiwan

Panelists:
Zhengyou Zhang, Microsoft, USA
Ellen Yi-Luen Do, Georgia Tech, USA & National University of Singapore, Singapore

Touradj Ebrahimi, EPFL, Switzerland

 

19:00 – 21:00

Banquet at Istana Nelayan

Wednesday, September 24

9:00 – 10:00

Keynote 3: A New Domain-oriented Video Coding

Wen Gao, Peking University, China

Chair: Fernando Pereira, IST-IT, Portugal

10:00 – 11:00

Oral 5: HEVC

Chair: Oscar C. Au, HKUST, Hong Kong
Towards Efficient Wavefront Parallel Encoding of HEVC: Parallelism Analysis and Improvement
Keji Chen, Peking University, China; Yizhou Duan, China; Jun Sun, China; Zongming Guo, China
High Efficiency Video Coding (HEVC) is the new generation video coding standard which achieves significant improvement in coding efficiency. Although HEVC is promising in many applications, the increased computational complexity is a serious problem, which makes parallelization necessary in HEVC encoding. To better understand the bottleneck of parallelization and improve the encoding speed, in this paper, we propose a Coding Tree Blocks (CTB) level parallelism analysis method as well as a novel Inter-Frame Wavefront (IFW) parallel encoding method. First, by establishing the relationship between parallelism and dependence, parallelism is precisely described by CTB-level dependence as a criterion to evaluate different parallel methods of HEVC. On this basis, by effectively decreasing the dependence based on Wavefront Parallel Processing (WPP), IFW method is developed. Finally, with the proposed parallelism analysis method, IFW is theoretically proved to be of higher parallelism compared with other HEVC representative parallel methods. Extensive experimental results show that, the proposed method and implementation can bring up to 17.81x, 14.34x and 24.40x speedup for HEVC encoding of WVGA, 720p and 1080p standard test sequences with the same ignorable coding performance degradation as WPP, thus showing a promising technology for future large-scale HEVC video application.
Highly Optimized Implementation of HEVC Decoder for General Processors
Shengbin Meng, Peking University, China; Yizhou Duan, China; Jun Sun, China; Zongming Guo, China
In this paper, we propose a novel design and optimized implementation of the HEVC decoder. First, a novel decoder prototype with refined decoding workflow and efficient memory management is designed. Then on this basis, a series of single-instruction-multiple-data (SIMD) based algorithms are used to speed up several time-consuming modules in HEVC decoding. Finally, a frame-based parallel framework is applied to exploit the multi-threading technology on multicore processors. With the highly optimized HEVC decoder, decoding speed of 246fps on Intel i7-2400 3.4GHz quad-core processor for 1080p videos and 52fps on ARM Cortex-A9 1.2GHz dual-core processor for 720p videos can be achieved in our experiments.
Adaptive Low Complexity Colour Transform for Video Coding
Rajitha Weerakkody, BBC R&D, UK, England; Marta Mrak, BBC R&D, UK, England
For video compression, the RGB signals are usually converted at the source to a perceptual colour space, followed by chroma sub-sampling, for coding efficiency. This is based on the typically higher human visual system sensitivity to the luminance than chrominance of image signals. However, there are specific applications that demand carrying the full RGB signals through the transmission chain, which may also benefit from lossless colour transforms, for efficient coding. In either case, the best colour transform function is noted to be content dependent, although fixed transforms are typically adopted for convenience. This paper presents a method of dynamically adapting this colour transform function for each picture block, using a class of low complexity lifting based schemes. The performance of the proposed algorithm is compared with a number of fixed colour transform schemes and shows a significant compression gain over native RGB coding and YCoCg transform
 
 

11:00 – 11:30

Morning Break

Poster 3: Visual Coding

Chair: Sugeng Santoso, STMIK Raharja, Indonesia
An Improved Rate Control Algorithm for SVC with Optimised MAD Prediction
Xin Lu, Warwick, England; Graham Martin, University of Warwick, England
An improved rate control algorithm for the Scalable Video Coding (SVC) extension of H.264/AVC is described. The rate control scheme applied to the Base Layer (BL) of SVC adopts the linear Mean Absolute Difference (MAD) prediction and quadratic Rate Distortion (RD) models inherited from H.264/AVC. A MAD prediction error always exists and cannot be avoided. However, some encoding results of the base layer can be used to inform the coding of the enhancement layers (ELs), thus benefitting from the bottom-up coding structure of SVC. This property forms the basis for the proposed rate control approach. Simulation results show that accurate rate control is achieved and, compared to the default rate control algorithm of SVC, namely the JVT-G012 rate control scheme, the average PSNR is increased by 0.27dB or the average bit rate is reduced by 4.81%.
Iterative Hierarchical True Motion Estimation for Temporal Frame Interpolation
Anton Veselov, SUAI, Russia; Marat Gilmutdinov, SUAI, Russia
Temporal frame interpolation is an important problem in many areas of modern video processing. However the task of accurate motion estimation for temporal interpolation is an open issue. In this paper we propose a new motion estimation algorithm for motion compensated frame interpolation. The developed algorithm is consistent with the conventional model of true motion and can be considered as a method of local minimization for the optimization problem defined within this model. The evaluation of the algorithm is performed for frame rate up-conversion problem. Simulation results demonstrate performance comparable to existing frame rate up-conversion methods.
A Novel Video Coding Scheme using a Scene Adaptive Non-Parametric Background Model
Subrata Chakraborty, School of Management & Enterprise, University of Southern Queensland, QLD 4300, Australia; Manoranjan Paul, Charles Sturt University, Australia; Manzur Murshed, Federation University, Australia; Mortuza Ali, Federation University, Australia
Video coding techniques utilising background frames, provide better rate distortion performance by exploiting coding efficiency in uncovered background areas compared to the latest video coding standard. Parametric approaches such as the mixture of Gaussian (MoG) based background modeling has been widely used however they require prior knowledge about the test videos for parameter estimation. Recently introduced non-parametric (NP) based background modeling techniques successfully improved video coding performance through a HEVC integrated coding scheme. The inherent nature of the NP technique naturally exhibits superior performance in dynamic background scenarios compared to the MoG based technique without a priori knowledge of video data distribution. Although NP based coding schemes showed promising coding performances, they suffer from a number of key challenges – (a) determination of the optimal subset of training frames for generating a suitable background that can be used as a reference frame during coding, (b) incorporating dynamic changes in the background effectively after the initial background frame is generated, (c) managing frequent scene change leading to performance degradation, and (d) optimizing coding quality ratio between an I-frame and other frames under bit rate constraints. In this study we develop a new scene adaptive coding scheme using the NP based technique, capable of solving the current challenges by incorporating a new continuously updating background generation process. Extensive experimental results are also provided to validate the effectiveness of the new scheme.
Fast Mode Decision for Error Resilient Video Coding
Yunong Wei, Communication Uni. of China, China; Yuan Zhang, China; Jinyao Yan, China
The error resilience and low-complexity video encoding are two major requirements of real-time visual communications on mobile devices. To address the two requirements simultaneously, this paper presents a fast mode decision algorithm for the error resilient video coding in packet loss environment. The proposed algorithm is a two-step method: early skip mode decision and early intra mode decision. Different from the existing methods for early skip mode decision, the proposed method takes the error-propagation distortion into account in estimating the coding cost. Considering the intra blocks are frequently used to terminate the error propagations, we also propose a method to fast estimate the intra block coding cost, so that the intra mode can be early determined. Overall, the proposed method can significantly reduce the encoding time while keeping the coding efficiency similar to the rate-distortion optimized mode decision method.
Compression of HD Videos by a Contrast-Based Human Attention Algorithm
Sylvia N’guessan, SCU, USA; Nam Ling, SCU, USA; ZhouyeGu, SCU, USA
The emergence of social networks combined with the prevalence of mobile technology has led to an increasing demand of high definition video transmission and storage. One of the challenges of video compression is the ability to reduce the video size without significant visual quality loss. In this paper, we propose a new method that achieves compression reduction levels ranging from 2.6% to 16.9% while maintaining or improving subjective quality. Precisely, our approach is a saliency-aware mechanism that predicts and classifies regions-of-interests (ROIs) of a typical human eye gaze according to the static attention model (SAM) from the human visual system (HVS). We coin the term contrast human attention regions of interest (Contrast-HAROIs) to refer to those identified regions. Finally, we reduce the data load of those non Contrast-HAROIs via a smoothing spatial filter. Experimental results carried on eight sequences show that our technique reduces the size of HD videos further than the standard H.264/AVC. Moreover, it is in average 30% times faster than another saliency and motion aware algorithm.
Block-based Compressive Sensing of Video using Local Sparsifying Transform
Byeungwoo Jeon, Sungkyunkwan University, South Korea; Chien Van Trinh, Sungkyunkwan University, South Korea; Viet Anh Nguyen, Sungkyunkwan University, South Korea
Block-based compressive sensing is very attractive for sensing natural images and video because it makes large sized image/video tractable. However, its reconstruction performance is yet to be improved much. This paper proposes a new block-based compressive video sensing algorithm which can recover video sequences with high quality. It generates initial key frames by incorporating the augmented Lagrangian total variation with a nonlocal means filter which is well known for being good at preserving edges and reducing noise. Additionally, local principal component analysis (PCA) transform is employed to enhance the detailed information. The non-key frames are initially predicted by their measurements and reconstructed key frames. Furthermore, PCA transform-aided side information regularization iteratively seeks better reconstructed quality. Simulation results manifest effectiveness of our algorithm.
A Fast Intermode Decision Algorithm Based on Analysis of Inter Prediction Residual
Luheng Jia, HKUST, Hong Kong; Oscar C. Au, HKUST, Hong Kong; Chi-yingTsui, HKUST, Hong Kong
Rate-distortion-optimized (RDO) intermode decision is one of the most effective tools that greatly improves the coding performance of modern coding standards, for example, H.264/AVC and HEVC. However, RDO intermode decision also leads to extremely intense computation. To reduce the complexity, a fast intermode decision algorithm is presented in this paper. Mathematical analysis of inter prediction residual is performed which explicitly shows the impact of edge information and motion characteristics to the prediction accuracy. Moreover, It is shown that with fixed quantization step that minimizing of R-D costs over different partition types is equivalent to minimizing the variance of transformed residual which can be expressed by motion vector and edge gradient components. In consequence, the complex calculation of rate and distortion is replaced by simple pre-analysis of video content. The repetition of motion estimation (ME) and entropy coding process are avoided. Inspired by the theoretical analysis, a fast inter mode decision algorithm is proposed. Experimental results show that the fast method achieves considerable complexity reduction with negligible coding performance degradation.
AUTOMATIC HIGH DYNAMIC RANGE HALLUCINATION IN INVERSE TONE MAPPING
Pin-Hung Kuo, National Taiwan University, Taiwan; Huai Jen Liang, National Taiwan University, Taiwan; Chi-Sun Tang, National Taiwan University, Taiwan; Shao-Yi Chien, National Taiwan University, Taiwan
Nowadays the dynamic range of displays has been higher and higher, which means that contents can be recorded and displayed with more detail. However, the original low dynamic range contents were recorded in a lower dynamic range. Such contents will be unsatisfying compared to high dynamic range contents, especially in the saturated, or over-exposed region. This paper proposes an algorithm to compensate such exposed regions, which is called automatic high dynamic range image hallucination for inverse tone mapping. Inverse tone-mapping is the process of creating a high dynamic range image from a single low dynamic range image. In this work, high dynamic range image hallucination is used as the key method to reproduce the information which is lost in the low dynamic range image capturing. Previous methods require user interaction as a hallucination criteria, and is not practical in some applications where user interaction is not available. In this paper, the hallucination is performed automatically with the assistance of luminance and texture decoupling process. This scheme produces visually satisfying results and has the potential to be applied to video inverse tone-mapping with its automatic property.
H.264/AVC Backward Compatible Bit-Depth Scalable Video Coding
Vasco Nascimento, IST-IT, Portugal; João Ascenso, IST-IT, Portugal; Fernando Pereira, IST-IT, Portugal
As high dynamic range video is gaining popularity, video coding solutions able to efficiently provide both low and high dynamic range video, notably with a single bitstream, are increasingly important. While simulcasting can provide both dynamic range videos at the cost of some compression efficiency penalty, bit-depth scalable video coding can provide a better trade-off between compression efficiency, adaptation flexibility and computational complexity. Considering the widespread use of H.264/AVC video, this paper proposes a H.264/AVC backward compatible bit-depth scalable video coding solution offering a low dynamic range base layer and two high dynamic range enhancement layers with different qualities, at low complexity. Experimental results show that the proposed solution has an acceptable rate-distortion performance penalty regarding the HDR H.264/AVC single-layer coding solution.
Correlation Modeling for a Distributed Scalable Video Codec based on the HEVC Standard
Xiem Hoang Van IST-IT, Portugal; João Ascenso, IST-IT, Portugal ; Fernando Pereira, IST-IT, Portugal
The growing heterogeneity of networks, devices and consumption conditions asks for flexible and adaptive video coding solutions. The compression power of the HEVC standard and the benefits of the distributed video coding paradigm allow designing novel scalable coding solutions with improved error robustness and low encoding complexity while still achieving competitive compression efficiency. In this context, this paper proposes a novel scalable video coding scheme using a HEVC Intra compliant base layer and a distributed coding approach in the enhancement layers (EL). This design inherits the HEVC compression efficiency while providing low encoding complexity at the enhancement layers. The temporal correlation is exploited at the decoder to create the EL side information (SI) residue, an estimation of the original residue. The EL encoder sends only the data that cannot be inferred at the decoder, thus exploiting the correlation between the original and SI residues; however, this correlation must be characterized with an accurate correlation model to obtain coding efficiency improvements. Therefore, this paper proposes a correlation modeling solution to be used at both encoder and decoder, without requiring a feedback channel. Experiments results confirm that the proposed scalable coding scheme has lower encoding complexity and provides BD-Rate savings up to 3.43% in comparison with the HEVC Intra scalable extension under development.

12:30 – 14:00

Lunch Break

14:00 – 15.00

Overview 3: 2D/3D AudioVisual Content Analysis & Description

Ioannis Pitas, Aristotle University of Thessaloniki, Greece

Chair: David Taubman, The University of New South Wales, Australia

15:00 – 16.00

Special Session: Quality Assessment

Chair: António Pinheiro, UBI, Portugal
Survey of Web-based Crowdsourcing Frameworks for Subjective Quality Assessment
Tobias Hossfeld, University of Würzburg, Germany; Matthias Hirth, University of Würzburg, Germany; Pavel Korshunov, EPFL, Switzerland; Philippe Hanhart, EPFL, Switzerland; Bruno Gardlo, FTW, Austria; Christian Keimel, TU Munich, Germany; Christian Timmerer, Multimedia Communication, Alpen-Adria-Universität Klagenfurt, Austria
The popularity of the crowdsourcing for performing various task online increased significantly in the past few years. The low cost and flexibility of crowdsourcing also attracted researchers in the field of subjective multimedia evaluations and Quality of Experience (QoE). Since online assessment of multimedia content is challenging, several dedicated frameworks were created to aid in the designing of the tests, including the support of the testing methodologies like ACR, DCR, PC, setting up the tasks, training sessions, screening of the subjects, and storage of the result data. In this paper, we evaluate the web-based frameworks for multimedia quality assessments that support commonly used crowdsourcing platforms such as Amazon Mechanical Turk and Microworkers. The paper provides a detailed overview of the crowdsourcing frameworks and aims to aid researchers in the field of QoE evaluation to select and employ frameworks and crowdsourcing platforms adequate for their experiments.
Free-viewpoint video sequences: a new challenge for objective quality metrics
Philippe Hanhart, EPFL, Switzerland; Emilie Bosc, Université de Nantes, France; Patrick Le Callet, Université de Nantes, France; Touradj Ebrahimi , EPFL, Switzerland
Free-viewpoint television is expected to create a more natural and interactive viewing experience by providing the ability to interactively change the viewpoint to enjoy a 3D scene. To render new virtual viewpoints, free-viewpoint systems rely on view synthesis. However, it is known that most objective metrics fail at predicting perceived quality of synthesized views. Therefore, it is legitimate to question the reliability of commonly used objective metrics to assess the quality of free-viewpoint video (FVV) sequences. In this paper, we analyze the performance of several commonly used objective quality metrics on FVV sequences, which were synthesized from decompressed depth data, using subjective scores as ground truth. Statistical analyses showed that commonly used metrics were not reliable predictors of perceived image quality when different contents and distortions were considered. However, the correlation improved when considering individual conditions, which indicates that the artifacts produced by some view synthesis algorithms might not be correctly handled by current metrics.
Subjective and objective quality assessment of HDR images compressed with JPEG-XT
Claire Mantel, Technical University of Denmark, Denmark; Stefan Catalin Ferchiu, Technical University of Denmark, Denmark; Søren Forchhammer, Technical University of Denmark, Denmark
In this paper a subjective test in which participants evaluate the quality of JPEG-XT compressed HDR images is presented. Results show that for the test images, the subjective quality reaches its saturation point starting around 3bpp. Objective evaluations from three objective metrics dedicated to HDR content are compared with subjective data both in physical domain and using a gamma correction to approximate perceptually uniform luminance coding. The MRSE metric obtains the best performance with the limit that it does not capture the quality saturation. The usage of the gamma correction prior to applying metrics depends on the characteristics of each objective metric.
 
 

16:00 – 16:30

Afternoon Break

16:30 – 17:10

Special Session: Quality Assessment (Cont)

Chair: António Pinheiro, UBI, Portugal
Performance evaluation of the emerging JPEG XT image compression standard
Antonio Pinheiro, UBI, Portugal; Karel Fliegel, CTU in Prague, Czech Republic; Pavel Korshunov, EPFL, Switzerland; Lukas Krasula, CTU in Prague, Czech Republic; Marco Bernardo, UBI, Portugal; Manuela Pereira, UBI, Portugal; Touradj Ebrahimi, EPFL, Switzerland
The upcoming JPEG XT is under development for High Dynamic Range (HDR) images compression. This standard encodes a Low Dynamic Range images version generated by a Tone Mapping Operator (TMO) using the conventional JPEG coding as a base layer and encodes the extra HDR information in a residual layer. This paper reports a study on the performance of the three profiles of JPEG XT (referred to as profiles A, B and C) using a test set of six HDR images. Moreover, four TMO techniques were used for the base layer image generation to assess the influence of TMO on JPEG XT profiles. Then, the HDR images were coded with different quality levels for the base layer and for the residual layer. The performance of each profile was evaluated using the Signal to Noise Ratio (SNR), the FSIM, the Root Mean Square Deviation (RMSE) objective metrics, as well as the CIEDE2000 color difference. Profiles A and B present similar behaviors. In both these profiles quality tends to saturate at higher bit rates, whereas in profile C the SNR continues to increase in higher bit rate, for the considered set of parameters.
QoE-Driven Performance Analysis of Cloud Gaming Services
Zi-Yi Wen, National ChiaoTung University, Taiwan; Hsu-Feng Hsiao, National Chiao Tung University, Taiwan
With the popularity of cloud computing services and the endorsement from the video game industry, cloud gaming services have emerged promisingly. In a cloud gaming service, the contents of games can be delivered to the clients through either video streaming or file streaming. Due to the strict constraint on the end-to-end latency for real-time interaction in a game, there are still challenges in designing a successful cloud gaming system, which needs to deliver satisfying quality of experience to the customers. In this paper, the methodology for subjective and objective evaluation as well as the analysis of cloud gaming services was developed. The methodology is based on a non-intrusive approach, and therefore, it can be used on different kinds of cloud gaming systems. There are challenges in such objective measurements of important QoS factors, due to the fact that most of the commercial cloud gaming systems are proprietary and closed. In addition, satisfactory QoE is one of the crucial ingredients in the success of cloud gaming services. By combining subjective and objective evaluation results, cloud gaming system developers can easily infer possible results of QoE level based on the measured QoS factors. It can also be used in an expert system for choosing the list of games that customers can appreciate at a given environment, as well as for deciding the upper bond of the number of users in a system simultaneously.
 
 

17:10 – 17:30

Closing Remarks

Chairs: Susanto Rahardja, National University of Singapore, Singapore and Zhengyou Zhang, Microsoft Research, USA