Relja Arandjelović

Dr Relja Arandjelović (Др Реља Аранђеловић)

This is a list of my selected publications, the full list is available on my Google Scholar profile

Jump to year:
[2023] [2022] [2021] [2020] [2019] [2018] [2017] [2016] [2015] [2014] [2013] [2012] [2011] [2010] [2009]

2023

@InProceedings{Balazevic23,
  author       = "Bala\vzevi\'c, I. and Steiner, D. and Parthasarathy, N. and Arandjelovi\'c, R and H\'enaff, O. J.",
  title        = "Towards in-context scene understanding",
  booktitle    = "Neural Information Processing Systems",
  year         = "2023",
}

In-context learning - the ability to configure a model's behavior with different prompts - has revolutionized the field of natural language processing, alleviating the need for task-specific models and paving the way for generalist models capable of assisting with any query. Computer vision, in contrast, has largely stayed in the former regime: specialized decoders and finetuning protocols are generally required to perform dense tasks such as semantic segmentation and depth estimation. In this work we explore a simple mechanism for in-context learning of such scene understanding tasks: nearest neighbor retrieval from a prompt of annotated features. We propose a new pretraining protocol – leveraging attention within and across images - which yields representations particularly useful in this regime. The resulting Hummingbird model, suitably prompted, performs various scene understanding tasks without modification while approaching the performance of specialists that have been finetuned for each task. Moreover, Hummingbird can be configured to perform new tasks much more efficiently than finetuned models, raising the possibility of scene understanding in the interactive assistant regime.

@Article{Arandjelovic23,
  author       = "Arandjelovi\'c, R and Andonian, A. and Mensch, A. and H\'enaff, O. J. and Alayrac, J.-B. and Zisserman, A.",
  title        = "Three ways to improve feature alignment for open vocabulary detection",
  journal      = "CoRR",
  volume       = "abs/2303.13518",
  year         = "2023",
}

The core problem in zero-shot open vocabulary detection is how to align visual and text features, so that the detector performs well on unseen classes. Previous approaches train the feature pyramid and detection head from scratch, which breaks the vision-text feature alignment established during pretraining, and struggles to prevent the language model from forgetting unseen classes.

We propose three methods to alleviate these issues. Firstly, a simple scheme is used to augment the text embeddings which prevents overfitting to a small number of classes seen during training, while simultaneously saving memory and computation. Secondly, the feature pyramid network and the detection head are modified to include trainable gated shortcuts, which encourages vision-text feature alignment and guarantees it at the start of detection training. Finally, a self-training approach is used to leverage a larger corpus of image-text pairs thus improving detection performance on classes with no human annotated bounding boxes.

Our three methods are evaluated on the zero-shot version of the LVIS benchmark, each of them showing clear and significant benefits. Our final network achieves the new state-of-the-art on the mAP-all metric and demonstrates competitive performance for mAP-rare, as well as superior transfer to COCO and Objects365.

2022

S. Koppula, Y. Li, A. Jaegle, E. Shelhamer, N. Parthasarathy, R. Arandjelović, J. Carreira, O. J. Hénaff
Where should I spend my FLOPS? Efficiency evaluations of visual pre-training methods
arXiv, 2022
| Bibtex | Abstract | PDF | arXiv |

@InProceedings{Koppula22,
  author       = "Koppula, S. and Li, Y. and Jaegle, A. and Shelhamer, E. and Parthasarathy, N. and Arandjelovi\'c, R and Carreira, J. and H\'enaff, O. J.",
  title        = "Where should I spend my {FLOPS}? {Efficiency} evaluations of visual pre-training methods",
  booktitle    = "NeurIPS workshop on Self-Supervised Learning",
  year         = "2022",
}

Self-supervised methods have achieved remarkable success in transfer learning, often achieving the same or better accuracy than supervised pre-training. Most prior work has done so by increasing pre-training computation by adding complex data augmentation, multiple views, or lengthy training schedules. In this work, we investigate a related, but orthogonal question: given a fixed FLOP budget, what are the best datasets, models, and (self-)supervised training methods for obtaining high accuracy on representative visual tasks? Given the availability of large datasets, this setting is often more relevant for both academic and industry labs alike. We examine five large-scale datasets (JFT-300M, ALIGN, ImageNet-1K, ImageNet-21K, and COCO) and six pre-training methods (CLIP, DINO, SimCLR, BYOL, Masked Autoencoding, and supervised). In a like-for-like fashion, we characterize their FLOP and CO2 footprints, relative to their accuracy when transferred to a canonical image segmentation task. Our analysis reveals strong disparities in the computational efficiency of pre-training methods and their dependence on dataset quality. In particular, our results call into question the commonly-held assumption that self-supervised methods inherently scale to large, uncurated data. We therefore advocate for (1) paying closer attention to dataset curation and (2) reporting of accuracies in context of the total computational cost.

O. J. Hénaff, S. Koppula, E. Shelhamer, D. Zoran, A. Jaegle, A. Zisserman, J. Carreira, R. Arandjelović
Object discovery and representation networks
European Conference on Computer Vision, 2022
| Bibtex | Abstract | PDF | arXiv |

@InProceedings{Henaff22,
  author       = "H\'enaff, O. J. and Koppula, S. and Shelhamer, E. and Zoran, D. and Jaegle, A. and Zisserman, A. and Carreira, J. and Arandjelovi\'c, R"
  title        = "Object discovery and representation networks",
  booktitle      = "ECCV",
  year         = "2022",
}

The promise of self-supervised learning (SSL) is to leverage large amounts of unlabeled data to solve complex tasks. While there has been excellent progress with simple, image-level learning, recent methods have shown the advantage of including knowledge of image structure. However, by introducing hand-crafted image segmentations to define regions of interest, or specialized augmentation strategies, these methods sacrifice the simplicity and generality that makes SSL so powerful. Instead, we propose a self-supervised learning paradigm that discovers the structure encoded in these priors by itself. Our method, Odin, couples object discovery and representation networks to discover meaningful image segmentations without any supervision. The resulting learning paradigm is simpler, less brittle, and more general, and achieves state-of-the-art transfer learning results for object detection and instance segmentation on COCO, and semantic segmentation on PASCAL and Cityscapes, while strongly surpassing supervised pre-training for video segmentation on DAVIS.

J. Carreira, S. Koppula, D. Zoran, A. Recasens, C. Ionescu, O. Hénaff, E. Shelhamer, R. Arandjelović, M. Botvinick, O. Vinyals, K. Simonyan, A. Zisserman, A. Jaegle
Hierarchical Perceiver
arXiv, 2022
| Bibtex | Abstract | PDF | arXiv |

@Article{Carreira22,
  author       = "Carreira, J. and Koppula, S. and Zoran, D. and Recasens, A. and Ionescu, C. and H\'enaff, O. and Shelhamer, E. and Arandjelović, R. and Botvinick, M. and Vinyals, O. and Simonyan, K. and Zisserman, A. and Jaegle, A.",
  title        = "Hierarchical Perceiver",
  journal      = "CoRR",
  volume       = "abs/2202.10890",
  year         = "2022",
}

General perception systems such as Perceivers can process arbitrary modalities in any combination and are able to handle up to a few hundred thousand inputs. They achieve this generality by exclusively using global attention operations. This however hinders them from scaling up to the inputs sizes required to process raw high-resolution images or video. In this paper, we show that some degree of locality can be introduced back into these models, greatly improving their efficiency while preserving their generality. To scale them further, we introduce a self-supervised approach that enables learning dense low-dimensional positional embeddings for very large signals. We call the resulting model a Hierarchical Perceiver (HiP). HiP retains the ability to process arbitrary modalities, but now at higher-resolution and without any specialized preprocessing, improving over flat Perceivers in both efficiency and accuracy on the ImageNet, Audioset and PASCAL VOC datasets.

@InProceedings{Yifan22,
  author       = "Yifan, W. and Doersch, C. and Arandjelovi\'c, R. and Carreira, J. and Zisserman, A.",
  title        = "Input-level Inductive Biases for {3D} Reconstruction",
  booktitle      = "IEEE/CVF Conference on Computer Vision and Pattern Recognition",
  year         = "2022",
}

Much of the recent progress in 3D vision has been driven by the development of specialized architectures that incorporate geometrical inductive biases. In this paper we tackle 3D reconstruction using a domain agnostic architecture and study how instead to inject the same type of inductive biases directly as extra inputs to the model. This approach makes it possible to apply existing general models, such as Perceivers, on this rich domain, without the need for architectural changes, while simultaneously maintaining data efficiency of bespoke models. In particular we study how to encode cameras, projective ray incidence and epipolar geometry as model inputs, and demonstrate competitive multi-view depth estimation performance on multiple benchmarks.

2021

@Article{Arandjelovic21,
  author       = "Arandjelovi\'c, R. and Zisserman, A.",
  title        = "{NeRF} in detail: {Learning} to sample for view synthesis",
  journal      = "CoRR",
  volume       = "abs/2106.05264",
  year         = "2021",
}

Neural radiance fields (NeRF) methods have demonstrated impressive novel view synthesis performance. The core approach is to render individual rays by querying a neural network at points sampled along the ray to obtain the density and colour of the sampled points, and integrating this information using the rendering equation. Since dense sampling is computationally prohibitive, a common solution is to perform coarse-to-fine sampling.

In this work we address a clear limitation of the vanilla coarse-to-fine approach -- that it is based on a heuristic and not trained end-to-end for the task at hand. We introduce a differentiable module that learns to propose samples and their importance for the fine network, and consider and compare multiple alternatives for its neural architecture. Training the proposal module from scratch can be unstable due to lack of supervision, so an effective pre-training strategy is also put forward. The approach, named `NeRF in detail' (NeRF-ID), achieves superior view synthesis quality over NeRF and the state-of-the-art on the synthetic Blender benchmark and on par or better performance on the real LLFF-NeRF scenes. Furthermore, by leveraging the predicted sample importance, a 25% saving in computation can be achieved without significantly sacrificing the rendering quality.

2020

@InProceedings{Alayrac20,
  author       = "Alayrac, J.-B. and Recasens, A. and Schneider, R. and Arandjelovi\'c, R. and Ramapuram, J. and De~Fauw, J. and Smaira, L. and Dieleman, S. and Zisserman, A.",
  title        = "Self-Supervised MultiModal Versatile Networks",
  booktitle    = "Neural Information Processing Systems",
  year         = "2020",
}

Videos are a rich source of multi-modal supervision. In this work, we learn representations using self-supervision by leveraging three modalities naturally present in videos: vision, audio and language. To this end, we introduce the notion of a multimodal versatile network -- a network that can ingest multiple modalities and whose representations enable downstream tasks in multiple modalities. In particular, we explore how best to combine the modalities, such that fine-grained representations of audio and vision can be maintained, whilst also integrating text into a common embedding. Driven by versatility, we also introduce a novel process of deflation, so that the networks can be effortlessly applied to the visual data in the form of video or a static image. We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks. Equipped with these representations, we obtain state-of-the-art performance on multiple challenging benchmarks including UCF101, HMDB51 and ESC-50 when compared to previous self-supervised work.

@InProceedings{Rocco20,
  author       = "Rocco, I. and Arandjelovi\'c, R. and Sivic, J.",
  title        = "Efficient Neighbourhood Consensus Networks via Submanifold Sparse Convolutions",
  booktitle    = "European Conference on Computer Vision",
  year         = "2020",
}

In this work we target the problem of estimating accurately localised correspondences between a pair of images. We adopt the recent Neighbourhood Consensus Networks that have demonstrated promising performance for difficult correspondence problems and propose modifications to overcome their main limitations: large memory consumption, large inference time and poorly localised correspondences. Our proposed modifications can reduce the memory footprint and execution time more than 10×, with equivalent results. This is achieved by sparsifying the correlation tensor containing tentative matches, and its subsequent processing with a 4D CNN using submanifold sparse convolutions. Localisation accuracy is significantly improved by processing the input images in higher resolution, which is possible due to the reduced memory footprint, and by a novel two-stage correspondence relocalisation module. The proposed Sparse-NCNet method obtains state-of-the-art results on the HPatches Sequences and InLoc visual localisation benchmarks, and competitive results in the Aachen Day-Night benchmark.

I. Rocco, M. Cimpoi, R. Arandjelović, A. Torii, T. Pajdla, J. Sivic
NCNet: Neighbourhood Consensus Networks for Estimating Image Correspondences
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020
| Bibtex | Abstract | PDF | Code and models |

@Article{Rocco20b,
  author       = "Rocco, I. and Cimpoi, M. and Arandjelovi\'c, R. and Torii, A. and Pajdla, T. and Sivic, J.",
  title        = "{NCNet}: {Neighbourhood} Consensus Networks for Estimating Image Correspondences",
  journal      = "IEEE Transactions on Pattern Analysis and Machine Intelligence",
  year         = "2020",
}

We address the problem of finding reliable dense correspondences between a pair of images. This is a challenging task due to strong appearance differences between the corresponding scene elements and ambiguities generated by repetitive patterns. The contributions of this work are threefold. First, inspired by the classic idea of disambiguating feature matches using semi-local constraints, we develop an end-to-end trainable convolutional neural network architecture that identifies sets of spatially consistent matches by analyzing neighbourhood consensus patterns in the 4D space of all possible correspondences between a pair of images without the need for a global geometric model. Second, we demonstrate that the model can be trained effectively from weak supervision in the form of matching and non-matching image pairs without the need for costly manual annotation of point to point correspondences. Third, we show the proposed neighbourhood consensus network can be applied to a range of matching tasks including both category- and instance-level matching, obtaining the state-of-the-art results on the PF, TSS, InLoc and HPatches benchmarks.

@Article{Rocco20,
  author       = "Zhong, Y. and Arandjelovi\'c, R. and Zisserman, A.",
  title        = "Compact Deep Aggregation for Set Retrieval",
  journal      = "CoRR",
  volume       = "abs/2003.11794",
  year         = "2020",
}

The objective of this work is to learn a compact embedding of a set of descriptors that is suitable for efficient retrieval and ranking, whilst maintaining discriminability of the individual descriptors. We focus on a specific example of this general problem -- that of retrieving images containing multiple faces from a large scale dataset of images. Here the set consists of the face descriptors in each image, and given a query for multiple identities, the goal is then to retrieve, in order, images which contain all the identities, all but one, etc.
To this end, we make the following contributions: first, we propose a CNN architecture -- SetNet -- to achieve the objective: it learns face descriptors and their aggregation over a set to produce a compact fixed length descriptor designed for set retrieval, and the score of an image is a count of the number of identities that match the query; second, we show that this compact descriptor has minimal loss of discriminability up to two faces per image, and degrades slowly after that -- far exceeding a number of baselines; third, we explore the speed vs. retrieval quality trade-off for set retrieval using this compact descriptor; and, finally, we collect and annotate a large dataset of images containing various number of celebrities, which we use for evaluation and is publicly released.

2019

@InProceedings{Alayrac19,
  author       = "Alayrac, J-B. and Carreira, J. and Arandjelovi\'c, R. and Zisserman, A.",
  title        = "Controllable Attention for Structured Layered Video Decomposition",
  booktitle    = "IEEE/CVF International Conference on Computer Vision",
  year         = "2019",
}

The objective of this paper is to be able to separate a video into its natural layers, and to control which of the separated layers to attend to. For example, to be able to separate reflections, transparency or object motion.

We make the following three contributions: (i) we introduce a new structured neural network architecture that explicitly incorporates layers (as spatial masks) into its design. This improves separation performance over previous general purpose networks for this task; (ii) we demonstrate that we can augment the architecture to leverage external cues such as audio for controllability and to help disambiguation; and (iii) we experimentally demonstrate the effectiveness of our approach and training procedure with controlled experiments while also showing that the proposed model can be successfully applied to real-word applications such as reflection removal and action recognition in cluttered scenes.

S. Gowal, K. Dvijotham, R. Stanforth, R. Bunel, C. Qin, J. Uesato, R. Arandjelović, T. Mann, P. Kohli
Scalable Verified Training for Provably Robust Image Classification
IEEE/CVF International Conference on Computer Vision, 2019
| Bibtex | Abstract | PDF | arXiv |

@InProceedings{Gowal19,
  author       = "Gowal, S. and Dvijotham, K. and Stanforth, R. and Bunel, R. and Qin, C. and Uesato, J. and Arandjelovi\'c, R. and Mann, T. and Kohli, P.",
  title        = "Scalable Verified Training for Provably Robust Image Classification",
  booktitle    = "IEEE/CVF International Conference on Computer Vision",
  year         = "2019",
}

Recent work has shown that it is possible to train deep neural networks that are provably robust to norm-bounded adversarial perturbations. Most of these methods are based on minimizing an upper bound on the worst-case loss over all possible adversarial perturbations. While these techniques show promise, they often result in difficult optimization procedures that remain hard to scale to larger networks. Through a comprehensive analysis, we show how a simple bounding technique, interval bound propagation (IBP), can be exploited to train large provably robust neural networks that beat the state-of-the-art in verified accuracy. While the upper bound computed by IBP can be quite weak for general networks, we demonstrate that an appropriate loss and clever hyper-parameter schedule allow the network to adapt such that the IBP bound is tight. This results in a fast and stable learning algorithm that outperforms more sophisticated methods and achieves state-of-the-art results on MNIST, CIFAR-10 and SVHN. It also allows us to train the largest model to be verified beyond vacuous bounds on a downscaled version of ImageNet.

@Article{Arandjelovic19,
  author       = "Arandjelovi\'c, R. and Zisserman, A.",
  title        = "Object Discovery with a Copy-Pasting {GAN}",
  journal      = "CoRR",
  volume       = "abs/1905.11369",
  year         = "2019",
}

We tackle the problem of object discovery, where objects are segmented for a given input image, and the system is trained without using any direct supervision whatsoever. A novel copy-pasting GAN framework is proposed, where the generator learns to discover an object in one image by compositing it into another image such that the discriminator cannot tell that the resulting image is fake. After carefully addressing subtle issues, such as preventing the generator from `cheating', this game results in the generator learning to select objects, as copy-pasting objects is most likely to fool the discriminator. The system is shown to work well on four very different datasets, including large object appearance variations in challenging cluttered backgrounds.

2018

S. Gowal, K. Dvijotham, R. Stanforth, R. Bunel, C. Qin, J. Uesato, R. Arandjelović, T. Mann, P. Kohli
On the effectiveness of interval bound propagation for training verifiably robust models
NeurIPS workshop on Security in Machine Learning, 2018
* Best paper award, Oral presentation *
| Bibtex | Abstract | PDF | arXiv | Code | Presentation (video) |

@InProceedings{Gowal18,
  author       = "Gowal, S. and Dvijotham, K. and Stanforth, R. and Bunel, R. and Qin, C. and Uesato, J. and Arandjelovi\'c, R. and Mann, T. and Kohli, P.",
  title        = "On the effectiveness of interval bound propagation for training verifiably robust models",
  booktitle    = "NeurIPS workshop on Security in Machine Learning",
  year         = "2018",
}

Recent works have shown that it is possible to train models that are verifiably robust to norm-bounded adversarial perturbations. While these recent methods show promise, they remain hard to scale and difficult to tune. This paper investigates how interval bound propagation (IBP) using simple interval arithmetic can be exploited to train verifiably robust neural networks that are surprisingly effective. While IBP itself has been studied in prior work, our contribution is in showing that, with an appropriate loss and careful tuning of hyper-parameters, verified training with IBP leads to a fast and stable learning algorithm. We compare our approach with recent techniques, and train classifiers that improve on the state-of-the-art in single-model adversarial robustness: we reduce the verified error rate from 3.67% to 2.23% on MNIST (with ℓ∞ perturbations of ε=0.1), from 19.32% to 8.05% on MNIST (at ε=0.3), and from 78.22% to 72.91% on CIFAR-10 (at ε=8/255).

@InProceedings{Rocco18b,
  author       = "Rocco, I. and Cimpoi, M. and Arandjelovi\'c, R. and Torii, A. and Pajdla, T. and Sivic, J.",
  title        = "Neighbourhood Consensus Networks",
  booktitle    = "Neural Information Processing Systems",
  year         = "2018",
}

We address the problem of finding reliable dense correspondences between a pair of images. This is a challenging task due to strong appearance differences between the corresponding scene elements and ambiguities generated by repetitive patterns. The contributions of this work are threefold. First, inspired by the classic idea of disambiguating feature matches using semi-local constraints, we develop an end-to-end trainable convolutional neural network architecture that identifies sets of spatially consistent matches by analyzing neighbourhood consensus patterns in the 4D space of all possible correspondences between a pair of images without the need for a global geometric model. Second, we demonstrate that the model can be trained effectively from weak supervision in the form of matching and non-matching image pairs without the need for costly manual annotation of point to point correspondences. Third, we show the proposed neighbourhood consensus network can be applied to a range of matching tasks including both category- and instance-level matching, obtaining state-of-the-art results on the PF Pascal dataset and the InLoc indoor visual localization benchmark.

@InProceedings{Rocco18a,
  author       = "Rocco, I. and Arandjelovi\'c, R. and Sivic, J.",
  title        = "Convolutional neural network architecture for geometric matching",
  journal      = "IEEE Transactions on Pattern Analysis and Machine Intelligence",
  year         = "2018",
}

We address the problem of determining correspondences between two images in agreement with a geometric model such as an affine, homography or thin-plate spline transformation, and estimating its parameters. The contributions of this work are three-fold. First, we propose a convolutional neural network architecture for geometric matching. The architecture is based on three main components that mimic the standard steps of feature extraction, matching and simultaneous inlier detection and model parameter estimation, while being trainable end-to-end. Second, we demonstrate that the network parameters can be trained from synthetically generated imagery without the need for manual annotation and that our matching layer significantly increases generalization capabilities to never seen before images. Finally, we show that the same model can perform both instance-level and category-level matching giving state-of-the-art results on the challenging PF, TSS and Caltech-101 datasets.

@InProceedings{Arandjelovic18,
  author       = "Arandjelovi\'c, R. and Zisserman, A.",
  title        = "Objects that Sound",
  booktitle    = "European Conference on Computer Vision",
  year         = "2018",
}

In this paper our objectives are, first, networks that can embed audio and visual inputs into a common space that is suitable for cross-modal retrieval; and second, a network that can localize the object that sounds in an image, given the audio signal. We achieve both these objectives by training from unlabelled video using only audio-visual correspondence (AVC) as the objective function. This is a form of cross-modal self-supervision from video.

To this end, we design new network architectures that can be trained for cross-modal retrieval and localizing the sound source in an image, by using the AVC task. We make the following contributions: (i) show that audio and visual embeddings can be learnt that enable both within-mode (e.g. audio-to-audio) and between-mode retrieval; (ii) explore various architectures for the AVC task, including those for the visual stream that ingest a single image, or multiple images, or a single image and multi-frame optical flow; (iii) show that the semantic object that sounds within an image can be localized (using only the sound, no motion or flow information); and (iv) give a cautionary tale on how to avoid undesirable shortcuts in the data preparation.

@Article{Dvijotham18,
  author       = "Dvijotham, K. and Gowal, S. and Stanforth, R. and Arandjelovi\'c, R. and O'Donoghue, B. and Uesato, J. and Kohli, P.",
  title        = "Training verified learners with learned verifiers",
  journal      = "CoRR",
  volume       = "abs/1805.10265",
  year         = "2018",
}

This paper proposes a new algorithmic framework, predictor-verifier training, to train neural networks that are verifiable, i.e., networks that provably satisfy some desired input-output properties. The key idea is to simultaneously train two networks: a predictor network that performs the task at hand, e.g., predicting labels given inputs, and a verifier network that computes a bound on how well the predictor satisfies the properties being verified. Both networks can be trained simultaneously to optimize a weighted combination of the standard data-fitting loss and a term that bounds the maximum violation of the property. Experiments show that not only is the predictor-verifier architecture able to train networks to achieve state of the art verified robustness to adversarial examples with much shorter training times (outperforming previous algorithms on small datasets like MNIST and SVHN), but it can also be scaled to produce the first known (to the best of our knowledge) verifiably robust networks for CIFAR-10.

@InProceedings{Rocco18,
  author       = "Rocco, I. and Arandjelovi\'c, R. and Sivic, J.",
  title        = "End-to-end weakly-supervised semantic alignment",
  booktitle    = "IEEE Conference on Computer Vision and Pattern Recognition",
  year         = "2018",
}

We tackle the task of semantic alignment where the goal is to compute dense semantic correspondence aligning two images depicting objects of the same category. This is a challenging task due to large intra-class variation, changes in viewpoint and background clutter. We present the following three principal contributions. First, we develop a convolutional neural network architecture for semantic alignment that is trainable in an end-to-end manner from weak image-level supervision in the form of matching image pairs. The outcome is that parameters are learnt from rich appearance variation present in different but semantically related images without the need for tedious manual annotation of correspondences at training time. Second, the main component of this architecture is a differentiable soft inlier scoring module, inspired by the RANSAC inlier scoring procedure, that computes the quality of the alignment based on only geometrically consistent correspondences thereby reducing the effect of background clutter. Third, we demonstrate that the proposed approach achieves state-of-the-art performance on multiple standard benchmarks for semantic alignment.

@InProceedings{Zhong18a,
  author       = "Zhong, Y. and Arandjelovi\'c, R. and Zisserman, A.",
  title        = "{GhostVLAD} for set-based face recognition",
  booktitle    = "Asian Conference on Computer Vision",
  year         = "2018",
}

The objective of this paper is to learn a compact representation of image sets for template-based face recognition. We make the following contributions: first, we propose a network architecture which aggregates and embeds the face descriptors produced by deep convolutional neural networks into a compact fixed-length representation. This compact representation requires minimal memory storage and enables efficient similarity computation. Second, we propose a novel GhostVLAD layer that includes ghost clusters, that do not contribute to the aggregation. We show that a quality weighting on the input faces emerges automatically such that informative images contribute more than those with low quality, and that the ghost clusters enhance the network’s ability to deal with poor quality images. Third, we explore how input feature dimension, number of clusters and different training techniques affect the recognition performance. Given this analysis, we train a network that far exceeds the state-of-the-art on the IJB-B face recognition dataset. This is currently one of the most challenging public benchmarks, and we surpass the state-of-the-art on both the identification and verification protocols

Y. Zhong, R. Arandjelović, A. Zisserman
Compact Deep Aggregation for Set Retrieval
ECCV workshop on Compact and Efficient Feature Representation and Learning in Computer Vision, 2018
* Best paper award, Oral presentation *
| Bibtex | Abstract | PDF | Dataset |

@InProceedings{Zhong18,
  author       = "Zhong, Y. and Arandjelovi\'c, R. and Zisserman, A.",
  title        = "Compact Deep Aggregation for Set Retrieval",
  booktitle    = "ECCV workshop on Compact and Efficient Feature Representation and Learning in Computer Vision",
  year         = "2018",
}

The objective of this work is to learn a compact embedding of a set of descriptors that is suitable for efficient retrieval and ranking, whilst maintaining discriminability of the individual descriptors. We focus on a specific example of this general problem - that of retrieving images containing multiple faces from a large scale dataset of images. Here the set consists of the face descriptors in each image, and given a query for multiple identities, the goal is then to retrieve, in order, images which contain all the identities, all but one, etc.
To this end, we make the following contributions: first, we propose a CNN architecture - SetNet - to achieve the objective: it learns face descriptors and their aggregation over a set to produce a compact fixed length descriptor designed for set retrieval, and the score of an image is a count of the number of identities that match the query; second, we show that this compact descriptor has minimal loss of discriminability up to two faces per image, and degrades slowly after that - far exceeding a number of baselines; third, we explore the speed vs. retrieval quality trade-off for set retrieval using this compact descriptor; and, finally, we collect and annotate a large dataset of images containing various number of celebrities, which we use for evaluation and will be publicly released.

2017

@Article{Arandjelovic17b,
  author       = "Arandjelovi\'c, R. and Zisserman, A.",
  title        = "Objects that Sound",
  journal      = "CoRR",
  volume       = "abs/1712.06651",
  year         = "2017",
}

In this paper our objectives are, first, networks that can embed audio and visual inputs into a common space that is suitable for cross-modal retrieval; and second, a network that can localize the object that sounds in an image, given the audio signal. We achieve both these objectives by training from unlabelled video using only audio-visual correspondence (AVC) as the objective function. This is a form of cross-modal self-supervision from video.

To this end, we design new network architectures that can be trained using the AVC task for these functionalities: for cross-modal retrieval, and for localizing the source of a sound in an image. We make the following contributions: (i) show that audio and visual embedding can be learnt that enable both within-mode (e.g. audio-to-audio) and between-mode retrieval; (ii) explore various architectures for the AVC task, including those for the visual stream that ingest a single image, or multiple images, or a single image and multi-frame optical flow; (iii) show that the semantic object that sounds within an image can be localized (using only the sound, no motion or flow information); and (iv) give a cautionary tale in how to avoid undesirable shortcuts in the data preparation.

@Article{Rocco17a,
  author       = "Rocco, I. and Arandjelovi\'c, R. and Sivic, J.",
  title        = "End-to-end weakly-supervised semantic alignment",
  journal      = "CoRR",
  volume       = "abs/1712.06861",
  year         = "2017",
}

@InProceedings{Arandjelovic17,
  author       = "Arandjelovi\'c, R. and Zisserman, A.",
  title        = "Look, Listen and Learn",
  booktitle    = "IEEE International Conference on Computer Vision",
  year         = "2017",
}

We consider the question: what can be learnt by looking at and listening to a large number of unlabelled videos? There is a valuable, but so far untapped, source of information contained in the video itself -- the correspondence between the visual and the audio streams, and we introduce a novel "Audio-Visual Correspondence" learning task that makes use of this. Training visual and audio networks from scratch, without any additional supervision other than the raw unconstrained videos themselves, is shown to successfully solve this task, and, more interestingly, result in good visual and audio representations. These features set the new state-of-the-art on two sound classification benchmarks, and perform on par with the state-of-the-art self-supervised approaches on ImageNet classification. We also demonstrate that the network is able to localize objects in both modalities, as well as perform fine-grained recognition tasks.

@Article{Arandjelovic17a,
  author       = "Arandjelovi\'c, R. and Gronat, P. and Torii, A. and Pajdla, T. and Sivic, J.",
  title        = "{NetVLAD}: {CNN} architecture for weakly supervised place recognition",
  journal      = "IEEE Transactions on Pattern Analysis and Machine Intelligence",
  year         = "2017",
}

We tackle the problem of large scale visual place recognition, where the task is to quickly and accurately recognize the location of a given query photograph. We present the following four principal contributions. First, we develop a convolutional neural network (CNN) architecture that is trainable in an end-to-end manner directly for the place recognition task. The main component of this architecture, NetVLAD, is a new generalized VLAD layer, inspired by the "Vector of Locally Aggregated Descriptors" image representation commonly used in image retrieval. The layer is readily pluggable into any CNN architecture and amenable to training via backpropagation. Second, we create a new weakly supervised ranking loss, which enables end-to-end learning of the architecture's parameters from images depicting the same places over time downloaded from Google Street View Time Machine. Third, we develop an efficient training procedure which can be applied on very large-scale weakly labelled tasks. Finally, we show that the proposed architecture and training procedure significantly outperform non-learnt image representations and off-the-shelf CNN descriptors on challenging place recognition and image retrieval benchmarks.

@InProceedings{Rocco17,
  author       = "Rocco, I. and Arandjelovi\'c, R. and Sivic, J.",
  title        = "Convolutional neural network architecture for geometric matching",
  booktitle    = "IEEE Conference on Computer Vision and Pattern Recognition",
  year         = "2017",
}

We address the problem of determining correspondences between two images in agreement with a geometric model such as an affine or thin-plate-spline transformation, and estimating its parameters. The contributions of this work are three-fold. First, we propose a convolutional neural network architecture for geometric matching. The architecture is based on three main components that mimic the standard steps of feature extraction, matching and simultaneous inlier detection and model parameter estimation, while being trainable end-to-end. Second, we demonstrate that the network parameters can be trained from synthetically generated imagery without the need for manual annotation and that our matching layer significantly increases generalization capabilities to never seen before images. Finally, we show that the same model can perform both instance-level and category-level matching giving state-of-the-art results on the challenging Proposal Flow dataset.

@Article{Torii17,
  author       = "Torii, A. and Arandjelovi\'c, R. and Sivic, J. and Okutomi, M. and Pajdla, P.",
  title        = "24/7 place recognition by view synthesis",
  journal      = "IEEE Transactions on Pattern Analysis and Machine Intelligence",
  year         = "2017",
}

We address the problem of large-scale visual place recognition for situations where the scene undergoes a major change in appearance, for example, due to illumination (day/night), change of seasons, aging, or structural modifications over time such as buildings built or destroyed. Such situations represent a major challenge for current large-scale place recognition methods. This work has the following three principal contributions. First, we demonstrate that matching across large changes in the scene appearance becomes much easier when both the query image and the database image depict the scene from approximately the same viewpoint. Second, based on this observation, we develop a new place recognition approach that combines (i) an efficient synthesis of novel views with (ii) a compact indexable image representation. Third, we introduce a new challenging dataset of 1,125 camera-phone query images of Tokyo that contain major changes in illumination (day, sunset, night) as well as structural changes in the scene. We demonstrate that the proposed approach significantly outperforms other large-scale place recognition techniques on this challenging data.

2016

@InProceedings{Zhong16,
  author       = "Zhong, Y. and Arandjelovi\'c, R. and Zisserman, A.",
  title        = "Faces in Places: Compound Query Retrieval",
  booktitle    = "British Machine Vision Conference",
  year         = "2016",
}

The goal of this work is to retrieve images containing both a target person and a target scene type from a large dataset of images. At run time this compound query is handled using a face classifier trained for the person, and an image classifier trained for the scene type. We make three contributions: first, we propose a hybrid convolutional neural network architecture that produces place-descriptors that are aware of faces and their corresponding descriptors. The network is trained to correctly classify a combination of face and scene classifier scores. Second, we propose an image synthesis system to render high quality fully-labelled face-and-place images, and train the network only from these synthetic images. Last, but not least, we collect and annotate a dataset of real images containing celebrities in different places, and use this dataset to evaluate the retrieval system. We demonstrate significantly improved retrieval performance for compound queries using the new face-aware place-descriptors.

@Article{Babenko16,
  author       = "Babenko, A. and Arandjelovi\'c, R. and Lempitsky, V.",
  title        = "Pairwise Quantization",
  journal      = "CoRR",
  volume       = "abs/1606.01550",
  year         = "2016",
}

We consider the task of lossy compression of high-dimensional vectors through quantization. We propose the approach that learns quantization parameters by minimizing the distortion of scalar products and squared distances between pairs of points. This is in contrast to previous works that obtain these parameters through the minimization of the reconstruction error of individual points. The proposed approach proceeds by finding a linear transformation of the data that effectively reduces the minimization of the pairwise distortions to the minimization of individual reconstruction errors. After such transformation, any of the previously-proposed quantization approaches can be used. Despite the simplicity of this transformation, the experiments demonstrate that it achieves considerable reduction of the pairwise distortions compared to applying quantization directly to the untransformed data.

@InProceedings{Arandjelovic16,
  author       = "Arandjelovi\'c, R. and Gronat, P. and Torii, A. and Pajdla, T. and Sivic, J.",
  title        = "{NetVLAD}: {CNN} architecture for weakly supervised place recognition",
  booktitle    = "IEEE Conference on Computer Vision and Pattern Recognition",
  year         = "2016",
}

We tackle the problem of large scale visual place recognition, where the task is to quickly and accurately recognize the location of a given query photograph. We present the following three principal contributions. First, we develop a convolutional neural network (CNN) architecture that is trainable in an end-to-end manner directly for the place recognition task. The main component of this architecture, NetVLAD, is a new generalized VLAD layer, inspired by the "Vector of Locally Aggregated Descriptors" image representation commonly used in image retrieval. The layer is readily pluggable into any CNN architecture and amenable to training via backpropagation. Second, we develop a training procedure, based on a new weakly supervised ranking loss, to learn parameters of the architecture in an end-to-end manner from images depicting the same places over time downloaded from Google Street View Time Machine. Finally, we show that the proposed architecture significantly outperforms non-learnt image representations and off-the-shelf CNN descriptors on two challenging place recognition benchmarks, and improves over current state-of-the-art compact image representations on standard image retrieval benchmarks.

2015

@Article{Arandjelovic15,
  author       = "Arandjelovi\'c, R. and Gronat, P. and Torii, A. and Pajdla, T. and Sivic, J.",
  title        = "{NetVLAD}: {CNN} architecture for weakly supervised place recognition",
  journal      = "CoRR",
  volume       = "abs/1511.07247",
  year         = "2015",
}

We tackle the problem of large scale visual place recognition, where the task is to quickly and accurately recognize the location of a given query photograph. We present the following three principal contributions. First, we develop a convolutional neural network (CNN) architecture that is trainable in an end-to-end manner directly for the place recognition task. The main component of this architecture, NetVLAD, is a new generalized VLAD layer, inspired by the "Vector of Locally Aggregated Descriptors" image representation commonly used in image retrieval. The layer is readily pluggable into any CNN architecture and amenable to training via backpropagation. Second, we develop a training procedure, based on a new weakly supervised ranking loss, to learn parameters of the architecture in an end-to-end manner from images depicting the same places over time downloaded from Google Street View Time Machine. Finally, we show that the proposed architecture significantly outperforms non-learnt image representations and off-the-shelf CNN descriptors on two challenging place recognition benchmarks, and improves over current state-of-the-art compact image representations on standard image retrieval benchmarks.

@InProceedings{Torii15,
  author       = "Torii, A. and Arandjelovi\'c, R. and Sivic, J. and Okutomi, M. and Pajdla, P.",
  title        = "24/7 place recognition by view synthesis",
  booktitle    = "IEEE Conference on Computer Vision and Pattern Recognition",
  year         = "2015",
}

K. Chatfield, R. Arandjelović, O. M. Parkhi, A. Zisserman
On-the-fly Learning for Visual Search of Large-scale Image and Video Datasets
International Journal of Multimedia Information Retrieval, 2015
| Bibtex | Abstract | PDF |

@Article{Chatfield15,
  author       = "Chatfield, K. and Arandjelovi\'c, R. and Parkhi, O. M. and Zisserman, A.",
  title        = "On-the-fly Learning for Visual Search of Large-scale Image and Video Datasets",
  journal      = "International Journal of Multimedia Information Retrieval",
  year         = "2015",
}

The objective of this work is to visually search large-scale video datasets for semantic entities specified by a text query. The paradigm we explore is constructing visual models for such semantic entities on-the-fly, i.e. at run time, by using an image search engine to source visual training data for the text query. The approach combines fast and accurate learning and retrieval, and enables videos to be returned within seconds of specifying a query. We describe three classes of queries, each with its associated visual search method: object instances (using a bag of visual words approach for matching); object categories (using a discriminative classifier for ranking key frames); and faces (using a discriminative classifier for ranking face tracks). We discuss the features suitable for each class of query, for example Fisher vectors or features derived from convolutional neural networks (CNNs), and how these choices impact on the trade-off between three important performance measures for a real-time system of this kind, namely: (1) accuracy, (2) memory footprint, and (3) speed. We also discuss and compare a number of important implementation issues, such as how to remove ‘outliers’ in the downloaded images efficiently, and how to best obtain a single descriptor for a face track. We also sketch the architecture of the real-time on-the-fly system. Quantitative results are given on a number of large-scale image and video benchmarks (e.g. TRECVID INS, MIRFLICKR-1M), and we further demonstrate the performance and real-world applicability of our methods over a dataset sourced from 10,000 h of unedited footage from BBC News, comprising 5M+ key frames.

2014

R. Arandjelović, A. Zisserman
Extremely low bit-rate nearest neighbor search using a Set Compression Tree
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014
| Bibtex | Abstract | PDF |

@Article{Arandjelovic14b,
  author       = "Arandjelovi\'c, R. and Zisserman, A.",
  title        = "Extremely Low Bit-Rate Nearest Neighbor Search Using a Set Compression Tree",
  journal      = "IEEE Transactions on Pattern Analysis and Machine Intelligence",
  year         = "2014",
}

The goal of this work is a data structure to support approximate nearest neighbor search on very large scale sets of vector descriptors. The criteria we wish to optimize are: (i) that the memory footprint of the representation should be very small (so that it fits into main memory); and (ii) that the approximation of the original vectors should be accurate.

We introduce a novel encoding method, named a Set Compression Tree (SCT), that satisfies these criteria. It is able to accurately compress 1 million descriptors using only a few bits per descriptor. The large compression rate is achieved by not compressing on a per-descriptor basis, but instead by compressing the set of descriptors jointly. We describe the encoding, decoding and use for nearest neighbor search, all of which are quite straightforward to implement.

The method, tested on standard benchmarks (SIFT1M and 80 Million Tiny Images), achieves superior performance to a number of state-of-the-art approaches, including Product Quantization, Locality Sensitive Hashing, Spectral Hashing, and Iterative Quantization. For example, SCT has a lower error using 5 bits than any of the other approaches, even when they use 16 or more bits per descriptor. We also include a comparison of all the above methods on the standard benchmarks.

@InProceedings{Arandjelovic14,
    author       = "Arandjelovi\'c, R. and Zisserman, A.",
    title        = "Visual vocabulary with a semantic twist",
    booktitle    = "Asian Conference on Computer Vision",
    year         = "2014",
}

Successful large scale object instance retrieval systems are typically based on accurate matching of local descriptors, such as SIFT. However, these local descriptors are often not sufficiently distinctive to prevent false correspondences, as they only consider the gradient appearance of the local patch, without being able to "see the big picture".

We describe a method, SemanticSIFT, which takes account of local image semantic content (such as grass and sky) in matching, and thereby eliminates many false matches. We show that this enhanced descriptor can be employed in standard large scale inverted file systems with the following benefits: improved precision (as false retrievals are suppressed); an almost two-fold speedup in retrieval speed (as posting lists are shorter on average); and, depending on the target application, a 20% decrease in memory requirements (since unrequired 'semantic' words can be removed). Furthermore, we also introduce a fast, and near state of the art, semantic segmentation algorithm.

Quantitative and qualitative results on standard benchmark datasets (Oxford Buildings 5k and 105k) demonstrate the effectiveness of our approach.

@InProceedings{Arandjelovic14a,
    author       = "Arandjelovi\'c, R. and Zisserman, A.",
    title        = "{DisLocation}: {Scalable} descriptor distinctiveness for location recognition",
    booktitle    = "Asian Conference on Computer Vision",
    year         = "2014",
}

The objective of this paper is to improve large scale visual object retrieval for visual place recognition. Geo-localization based on a visual query is made difficult by plenty of non-distinctive features which commonly occur in imagery of urban environments, such as generic modern windows, doors, cars, trees, etc. The focus of this work is to adapt standard Hamming Embedding retrieval system to account for varying descriptor distinctiveness. To this end, we propose a novel method for efficiently estimating distinctiveness of all database descriptors, based on estimating local descriptor density everywhere in the descriptor space. In contrast to all competing methods, the (unsupervised) training time for our method (DisLoc) is linear in the number database descriptors and takes only a 100 seconds on a single CPU core for a 1 million image database. Furthermore, the added memory requirements are negligible (1%).

The method is evaluated on standard publicly available large-scale place recognition benchmarks containing street-view imagery of Pittsburgh and San Francisco. DisLoc is shown to outperform all baselines, while setting the new state-of-the-art on both benchmarks. The method is compatible with spatial reranking, which further improves recognition results.

Finally, we also demonstrate that 7% of the least distinctive features can be removed, therefore reducing storage requirements and improving retrieval speed, without any loss in place recognition accuracy.

J. S. Chung, R. Arandjelović, G. Bergel, A. Franklin, A. Zisserman
Re-presentations of Art Collections
ECCV workshop on Computer Vision for ART Analysis, 2014
| Bibtex | Abstract | PDF |

@InProceedings{Chung14,
    author       = "Chung, J.~S. and Arandjelovi\'c, R. and Bergel, G. and Franklin, A. and Zisserman, A.",
    title        = "Re-presentations of Art Collections",
    booktitle    = "Workshop on Computer Vision for Art Analysis, ECCV",
    year         = "2014",
}

The objective of this paper is to show how modern computer vision methods can be used to aid the art or book historian in analysing large digital art collections.

We make three contributions: first, we show that simple document processing methods in combination with accurate instance based retrieval methods can be used to automatically obtain all the illustrations from a collection of illustrated documents. Second, we show that image level descriptors can be used to automatically cluster collections of images based on their categories, and thereby represent a collection by its semantic content. Third, we show that instance matching can be used to identify illustrations from the same source, e.g. printed from the same woodblock, and thereby represent a collection in a manner suitable for temporal analysis of the printing process.

These contributions are demonstrated on a collection of illustrated English Ballad sheets.

T. Tomassi, R. Aly, K. McGuinness, K. Chatfield, R. Arandjelović, O. M. Parkhi, R. Ordelman, A. Zisserman, T. Tuytelaars
Beyond metadata: Searching your archive based on its audio-visual content
International Broadcasting Convention, 2014
| Bibtex | Abstract | PDF |

@InProceedings{Tommasi14,
  author       = "Tomassi, T. and Aly, R. and McGuinness, K. and Chatfield, K. and Arandjelovi\'c, R. and Parkhi, O.~M. and Ordelman, R. and Zisserman, A. and Tuytelaars, T.",
  title        = "Beyond metadata: {Searching} your archive based on its audio-visual content",
  booktitle    = "International Broadcasting Convention",
  year         = "2014",
}

The EU FP7 project AXES aims at better understanding the needs of archive users and supporting them with systems that reach beyond the state-of-the-art. Our system allows users to instantaneously retrieve content using metadata, spoken words, or a vocabulary of reliably detected visual concepts comprising places, objects and events. Additionally, users can query for new concepts, for which models are learned on-the-fly, using training images obtained from an internet search engine. Thanks to advanced analysis and indexation methods, relevant material can be retrieved within seconds. Our system supports different types of models for object categories (e.g. "bus" or "house"), specific objects (landmarks or logos), person categories (e.g. "people with moustaches"), or specific persons (e.g. "President Obama"). Next to text queries, we support query-by-example, which retrieves content containing the same location, objects, or faces shown in provided images. Finally, our system provides alternatives to query-based retrieval by allowing users to browse archives using generated links. Here we evaluate the precision of the retrieved results based on textual queries describing visual content, with the queries extracted from user testing query logs.

2013

R. Arandjelović, A. Zisserman
All about VLAD
IEEE Conference on Computer Vision and Pattern Recognition, 2013
| Bibtex | Abstract | PDF |

@InProceedings{Arandjelovic13,
    author       = "Arandjelovi\'c, R. and Zisserman, A.",
    title        = "All about {VLAD}",
    booktitle    = "IEEE Conference on Computer Vision and Pattern Recognition",
    year         = "2013",
}

The objective of this paper is large scale object instance retrieval, given a query image. A starting point of such systems is feature detection and description, for example using SIFT. The focus of this paper, however, is towards very large scale retrieval where, due to storage requirements, very compact image descriptors are required and no information about the original SIFT descriptors can be accessed directly at run time.

We start from VLAD, the state-of-the art compact descriptor introduced by Jegou et al. for this purpose, and make three novel contributions: first, we show that a simple change to the normalization method significantly improves retrieval performance; second, we show that vocabulary adaptation can substantially alleviate problems caused when images are added to the dataset after initial vocabulary learning. These two methods set a new state-of-the-art over all benchmarks investigated here for both mid-dimensional (20k-D to 30k-D) and small (128-D) descriptors.

Our third contribution is a multiple spatial VLAD representation, MultiVLAD, that allows the retrieval and localization of objects that only extend over a small part of an image (again without requiring use of the original image SIFT descriptors).

R. Arandjelović, A. Zisserman
Extremely low bit-rate nearest neighbor search using a Set Compression Tree
Technical Report, Department of Engineering Science, University of Oxford, 2013
| Bibtex | Abstract | PDF | Presentation |

@TechReport{Arandjelovic13a,
    author       = "Arandjelovi\'c, R. and Zisserman, A.",
    title        = "Extremely low bit-rate nearest neighbor search using a {S}et {C}ompression {T}ree",
    institution  = "Department of Engineering Science, University of Oxford",
    year         = "2013",
}

The goal of this work is a data structure to support approximate nearest neighbor search on very large scale sets of vector descriptors. The criteria we wish to optimize are: (i) that the memory footprint of the representation should be very small (so that it fits into main memory); and (ii) that the approximation of the original vectors should be accurate.

We introduce a novel encoding method, named a Set Compression Tree (SCT), that satisfies these criteria. It is able to compress 1 million descriptors using only a few bits per descriptor, and at high accuracy. The large compression rate is achieved by not compressing on a per-descriptor basis, but instead by compressing the set of descriptors jointly, i.e. if the set of descriptors is { x_1, x_2, ..., x_n } then the compressed set is com{ x_1, x_2, ..., x_n } rather than { com(x_1), com(x_2), ..., com(x_n) }. We describe the encoding, decoding and use for nearest neighbor search, all of which are quite straightforward to implement.

The method is compared on standard benchmarks (SIFT1M and 80 Million Tiny Images) to a number of state of the art approaches, including Product Quantization, Locality Sensitive Hashing, Spectral Hashing, and Iterative Quantization. In all cases SCT has superior performance. For example, SCT has a lower error using 5 bits than any of the other approaches, even when they use 16 or more bits per descriptor. We also include a comparison of all the above methods on the standard benchmarks.

R. Arandjelović
Advancing Large Scale Object Retrieval
PhD thesis from University of Oxford, 2013
| Bibtex | Abstract | PDF |

@PhdThesis{Arandjelovic13b,
  author       = "Arandjelovi\'c, R.",
  title        = "Advancing Large Scale Object Retrieval",
  school       = "University of Oxford",
  year         = "2013",
}

The objective of this work is object retrieval in large scale image datasets, where the object is specified by an image query and retrieval should be immediate at run time. Such a system has a wide variety of applications including object or location recognition, video search, near duplicate detection and 3D reconstruction. The task is very challenging because of large variations in the imaged object appearance due to changes in lighting conditions, scale and viewpoint, as well as partial occlusions.

A starting point of established systems which tackle the same task is detection of viewpoint invariant features, which are then quantized into visual words and efficient retrieval is performed using an inverted index. We make the following three improvements to the standard framework: (i) a new method to compare SIFT descriptors (RootSIFT) which yields superior performance without increasing processing or storage requirements; (ii) a novel discriminative method for query expansion; (iii) a new feature augmentation method.

Scaling up to searching millions of images involves either distributing storage and computation across many computers, or employing very compact image representations on a single computer combined with memory-efficient approximate nearest neighbour search (ANN). We take the latter approach and improve VLAD, a popular compact image descriptor, using: (i) a new normalization method to alleviate the burstiness effect; (ii) vocabulary adaptation to reduce influence of using a bad visual vocabulary; (iii) extraction of multiple VLADs for retrieval and localization of small objects. We also propose a method, SCT, for extremely low bit-rate compression of descriptor sets in order to reduce the memory footprint of ANN.

The problem of finding images of an object in an unannotated image corpus starting from a textual query is also considered. Our approach is to first obtain multiple images of the queried object using textual Google image search, and then use these images to visually query the target database. We show that issuing multiple queries significantly improves recall and enables the system to find quite challenging occurrences of the queried object.

Current retrieval techniques work only for objects which have a light coating of texture, while failing completely for smooth (fairly textureless) objects best described by shape. We present a scalable approach to smooth object retrieval and illustrate it on sculptures. A smooth object is represented by its imaged shape using a set of quantized semi-local boundary descriptors (a bag-of-boundaries); the representation is suited to the standard visual word based object retrieval. Furthermore, we describe a method for automatically determining the title and sculptor of an imaged sculpture using the proposed smooth object retrieval system.

G. Bergel, A. Franklin, M. Heaney, R. Arandjelović, A. Zisserman, D. Funke
Content-Based Image-Recognition on Printed Broadside Ballads: The Bodleian Libraries' ImageMatch Tool
IFLA World Library and Information Congress, 2013
| Bibtex | Abstract | PDF |

@InProceedings{Bergel13,
    author       = "Bergel, G. and Franklin, A. and Heaney, M. and Arandjelovi\'c, R. and Zisserman, A. and Funke, D.",
    title        = "Content-Based Image-Recognition on Printed {Broadside Ballads}: {The Bodleian Libraries' ImageMatch} Tool",
    booktitle    = "IFLA World Library and Information Congress",
    year         = "2013",
}

This paper introduces the Bodleian Ballads ImageMatch tool, developed by the Visual Geometry Group of the University of Oxford's Department of Software Engineering on behalf of the Bodleian Libraries. ImageMatch was designed to assist with the cataloguing and study of the pictorial content of early British printed broadside ballads, but has potential value for many other kinds of printed material. The paper outlines the nature of the materials to which ImageMatch has been applied; describes how the tool works and what it can do; and will offers some discussion on the benefits of ImageMatch's for image-cataloguing in Rare Books collections.

R. Aly et al.
The AXES submissions at TrecVid 2013
TRECVid Workshop, 2013
| Bibtex | Abstract | PDF |

@InProceedings{Aly13,
  author       = "Aly, R. and Arandjelovi\'c, R. and Chatfield, K. and Douze, M. and Fernando, B. and Harchaoui, Z. and McGuinness, K. and O'Connor, N.~E. and Oneata, D. and Parkhi, O.~M. and Potapov, D. and Revaud, J. and Schmid, C. and Schwenninger, J. and Scott, D. and Tuytelaars, T. and Verbeek, J. and Wang, H. and Zisserman, A.",
  title        = "The {AXES} submissions at {TrecVid} 2013",
  booktitle    = "TRECVid Workshop",
  year         = "2013",
}

The AXES project participated in the interactive instance search task (INS), the semantic indexing task (SIN), the multimedia event detection task (MED) and the multimedia event recounting task (MER) for TRECVid 2013. Our interactive INS focused this year on using classifiers trained at query time with positive examples collected from external search engines. Participants in our INS experiments were carried out by students and researchers at Dublin City University. Our best INS runs performed on par with the top ranked INS runs in terms of P@10 and P@30, and around the median in terms of mAP.

For SIN, MED and MER, we use state-of-the-art low-level descriptors for motion, image and sound as well as high-level features for speech and text. The low-level descriptors are aggregated with Fisher vectors into high-dimensional video-level signatures and the high-level features are aggregated into bag-ofword histograms. Given these features we train linear classifiers, and use early and late-fusion to combine the different features. Our MED system achieved the best score of all submitted runs in the main track as well as in the ad-hoc track.

This paper describes in detail our INS, SIN, MED and MER systems and the results and findings of our experiments.

2012

@InProceedings{Arandjelovic12,
    author       = "Arandjelovi\'c, R. and Zisserman, A.",
    title        = "Three things everyone should know to improve object retrieval",
    booktitle    = "IEEE Conference on Computer Vision and Pattern Recognition",
    year         = "2012",
}

The objective of this work is object retrieval in large scale image datasets, where the object is specified by an image query and retrieval should be immediate at run time in the manner of Video Google.

We make the following three contributions: (i) a new method to compare SIFT descriptors (RootSIFT) which yields superior performance without increasing processing or storage requirements; (ii) a novel method for query expansion where a richer model for the query is learnt discriminatively in a form suited to immediate retrieval through efficient use of the inverted index; (iii) an improvement of the image augmentation method proposed by Turcot and Lowe, where only the augmenting features which are spatially consistent with the augmented image are kept.

We evaluate these three methods over a number of standard benchmark datasets (Oxford Buildings 5k and 105k, and Paris 6k) and demonstrate substantial improvements in retrieval performance whilst maintaining immediate retrieval speeds. Combining these complementary methods achieves a new state-of-the-art performance on these datasets.

@InProceedings{Arandjelovic12a,
    author       = "Arandjelovi\'c, R. and Zisserman, A.",
    title        = "Name that Sculpture",
    booktitle    = "ACM International Conference on Multimedia Retrieval",
    year         = "2012",
}

We describe a retrieval based method for automatically determining the title and sculptor of an imaged sculpture. This is a useful problem to solve, but also quite challenging given the variety in both form and material that sculptures can take, and the similarity in both appearance and names that can occur.

Our approach is to first visually match the sculpture and then to name it by harnessing the meta-data provided by Flickr users. To this end we make the following three contributions: (i) we show that using two complementary visual retrieval methods (one based on visual words, the other on boundaries) improves both retrieval and precision performance; (ii) we show that a simple voting scheme on the tf-idf weighted meta-data can correctly hypothesize a sub-set of the sculpture name (provided that the meta-data has first been suitably cleaned up and normalized); and (iii) we show that Google image search can be used to query expand the name sub-set, and thereby correctly determine the full name of the sculpture.

The method is demonstrated on over 500 sculptors covering more than 2000 sculptures. We also quantitatively evaluate the system and demonstrate correct identification of the sculpture on over 60% of the queries.

@InProceedings{Arandjelovic12b,
    author       = "Arandjelovi\'c, R. and Zisserman, A.",
    title        = "Multiple queries for large scale specific object retrieval",
    booktitle    = "British Machine Vision Conference",
    year         = "2012",
}

The aim of large scale specific-object image retrieval systems is to instantaneously find images that contain the query object in the image database. Current systems, for example Google Goggles, concentrate on querying using a single view of an object, e.g. a photo a user takes with his mobile phone, in order to answer the question \'what is this?\'. Here we consider the somewhat converse problem of finding all images of an object given that the user knows what he is looking for; so the input modality is text, not an image. This problem is useful in a number of settings, for example media production teams are interested in searching internal databases for images or video footage to accompany news reports and newspaper articles.

Given a textual query (e.g. \'coca cola bottle\'), our approach is to first obtain multiple images of the queried object using textual Google image search. These images are then used to visually query the target database to discover images containing the object of interest. We compare a number of different methods for combining the multiple query images, including discriminative learning. We show that issuing multiple queries significantly improves recall and enables the system to find quite challenging occurrences of the queried object.

The system is evaluated quantitatively on the standard Oxford Buildings benchmark dataset where it achieves very high retrieval performance, and also qualitatively on the TrecVid 2011 known-item search dataset.

R. Aly et al.
AXES at TRECVid 2012: KIS, INS, and MED
TRECVid Workshop, 2012
| Bibtex | Abstract | PDF |

@InProceedings{Aly12,
  author       = "Aly, R. and McGuinness, K. and Chen, S. and O'Connor, N.~E. and Chatfield, K. and Parkhi, O.~M. and Arandjelovi\'c, R. and Zisserman, A. and Fernando, B. and Tuytelaars, T. and Schwenninger, J. and Oneata, D. and Douze, M. and Revaud, J. and Potapov, D. and Wang, H. and Harchaoui, Z. and Verbeek, J. and Schmid, C.",
  title        = "{AXES} at {TRECVid} 2012: {KIS}, {INS}, and {MED}",
  booktitle    = "TRECVid Workshop",
  year         = "2012",
}

The AXES project participated in the interactive instance search task (INS), the known-item search task (KIS), and the multimedia event detection task (MED) for TRECVid 2012. As in our TRECVid 2011 system, we used nearly identical search systems and user interfaces for both INS and KIS. Our interactive INS and KIS systems focused this year on using classifiers trained at query time with positive examples collected from external search engines. Participants in our KIS experiments were media professionals from the BBC; our INS experiments were carried out by students and researchers at Dublin City University. We performed comparatively well in both experiments. Our best KIS run found 13 of the 25 topics, and our best INS runs outperformed all other submitted runs in terms of P@100. For MED, the system presented was based on a minimal number of low-level descriptors, which we chose to be as large as computationally feasible. These descriptors are aggregated to produce high-dimensional video-level signatures, which are used to train a set of linear classifiers. Our MED system achieved the second-best score of all submitted runs in the main track, and best score in the ad-hoc track, suggesting that a simple system based on state-of-the-art low-level descriptors can give relatively high performance. This paper describes in detail our KIS, INS, and MED systems and the results and findings of our experiments.

2011

@InProceedings{Arandjelovic11,
    author       = "Arandjelovi\'c, R. and Zisserman, A.",
    title        = "Smooth Object Retrieval using a Bag of Boundaries",
    booktitle    = "International Conference on Computer Vision",
    year         = "2011",
}

We describe a scalable approach to 3D smooth object retrieval which searches for and localizes all the occurrences of a user outlined object in a dataset of images in real time. The approach is illustrated on sculptures.

A smooth object is represented by its material appearance (sufficient for foreground/background segmentation) and imaged shape (using a set of semi-local boundary descriptors). The descriptors are tolerant to scale changes, segmentation failures, and limited viewpoint changes. Furthermore, we show that the descriptors may be vector quantized (into a bag-of-boundaries) giving a representation that is suited to the standard visual word architectures for immediate retrieval of specific objects.

We introduce a new dataset of 6K images containing sculptures by Moore and Rodin, and annotated with ground truth for the occurrence of twenty 3D sculptures. It is demonstrated that recognition can proceed successfully despite changes in viewpoint, illumination and partial occlusion, and also that instances of the same shape can be retrieved even though they may be made of different materials.

R. Arandjelović, T. M. Sezgin
Sketch recognition by fusion of temporal and image-based features
Pattern Recognition, 2011
| Bibtex | PDF |

@Article{Arandjelovic11_sketch,
    author       = "Arandjelovi\'c, R. and Sezgin, T.~M.",
    title        = "Sketch recognition by fusion of temporal and image-based features",
    journal      = "Pattern Recognition",
    volume       = "44",
    number       = "6",
    pages        = "1225--1234",
    year         = "2011",
}

K. McGuinness et al.
AXES at TRECVid 2011
TRECVid Workshop, 2011
| Bibtex | Abstract | PDF |

@InProceedings{McGuinness11,
  author       = "McGuinness, K. and Aly, R. and Chen, S. and Frappier, M. and Kleppe, M. and Lee, H. and Ordelman, R. and Arandjelovi\'c, R. and Juneja, M. and Jawahar, C.~V. and Vedaldi, A. and Schwenninger, J. and Tschopel, S. and Schneider, D. and O'Connor, N.~E. and Zisserman, A. and Smeaton, A. and Beunders, H.",
  title        = "{AXES} at {TRECVid} 2011",
  booktitle    = "TRECVid Workshop",
  year         = "2011",
}

The AXES project participated in the interactive known-item search task (KIS) and the interactive instance search task (INS) for TRECVid 2011. We used the same system architecture and a nearly identical user interface for both the KIS and INS tasks. Both systems made use of text search on ASR, visual concept detectors, and visual similarity search. The user experiments were carried out with media professionals and media students at the Netherlands Institute for Sound and Vision, with media professionals performing the KIS task and media students participating in the INS task. This paper describes the results and findings of our experiments.

2010

@InProceedings{Arandjelovic10,
    author       = "Arandjelovi\'c, R. and Zisserman, A.",
    title        = "Efficient Image Retrieval for {3D} Structures",
    booktitle    = "British Machine Vision Conference",
    year         = "2010",
}

Large scale image retrieval systems for specific objects generally employ visual words together with a ranking based on a geometric relation between the query and target images. Previous work has used planar homographies for this geometric relation. Here we replace the planar transformation by epipolar geometry in order to improve the retrieval performance for 3D structures.

To this end, we introduce a new minimal solution for computing the affine fundamental matrix. The solution requires only two corresponding elliptical regions. Unlike previous approaches it does not require the rotation of the image patches, and ensures that the necessary epipolar tangency constraints are satisfied.

The solution is well suited for real time reranking in large scale image retrieval, since (i) elliptical correspondences are readily available from the affine region detections, and (ii) the use of only two region correspondences is very efficient in a RANSAC framework where the number of samples required grows exponentially with sample size. We demonstrate a gain in computational efficiency (over other methods of solution) without a loss in quality of the estimated epipolar geometry.

We present a quantitative performance evaluation on the Oxford and Paris image retrieval benchmarks, and demonstrate that retrieval of 3D structures is indeed improved.

2009

A. Blessing, T. M. Sezgin, R. Arandjelović, P. Robinson
A Multimodal Interface for Road Design
IUI Workshop on Sketch Recognition, 2009
| Bibtex | PDF |

@InProceedings{Blessing09,
    author      = "Blessing, A. and Sezgin, T.~M. and Arandjelovi{\'c}, R. and Robinson, P.",
    title       = "A Multimodal Interface for Road Design",
    booktitle   = "IUI Workshop on Sketch Recognition",
    year        = "2009",
}