Landmark recognition – development and experiences – novatec blog

Landmark Recognition Challenge – Development and Experiences

In this post, I want to share my experiences within the Landmark Recognition Challenge of Kaggle. In this article you will learn how to develop a machine learning model to recognize landmarks in images out of 15k classes. Kaggle Competitions like Landmark Recognition in the future. I want to summarize my lessons learned at the end of this article. You can find the code in this repository.

The Landmark Recognition Challenge

So far, image classification challenges (for example ImageNet Large Scale Visual Recognition Challenge) were kept simple with a small number of classes and a lot of training examples per class. Landmark Recognition sets new standards in image classification with the largest worldwide dataset to date. In addition, one or even a few training images. At first these sounds are more difficult than older challenges. But there are two further snags, which take the Landmark Recognition Challenge to the next level:

  • There are test images with no landmark.
  • There are test images with more than one landmark.

Furthermore, it would be better to predict a landmark image, if the prediction score is too low. I wrote my classification into a submission csv file. This file should contain a header and have the following format:

Based on the Global Average Precision (GAP), called micro average precision (microAP). The final ranking (in Kaggler’s language: “Leaderboard”) is created on the basis of the GAP value. If you would like to know more about the GAP metric, click here.

Data preprocessing

For this competition I downloaded the following two datasets:

  • train.csv: 1.225.029 csv rows with URLs to train images labeled with their associated landmark
  • test.csv: contains contains 117,703 csv rows with URLs to test images

As previously said, the test images may have no landmark, one landmark or more than one landmark. The training images each depict one landmark. Each image contains a unique id, the URL and the labeled landmark id.

Train dataset with id, URL and landmark_id

In the test dataset each image contains a unique id and the image URL.

Test dataset with id and URL

As you can see, I have to download the images first, before I can use them. In the first trial I downloaded the images in full resolution (over 350 GB of data). So I decided to resize them. After that, I stored them in a resolution of 128 x 128 pixels, which reduced the size to 15 GB.

The train and test images are available in the same size now. We now have our preprocessing step for the test data. But we do not know the label of every image in the train data. Therefore I created a script which prepares for each landmark class and assigns the related images to these directories. Now we have 15,000 folders, one for each label.

Modeling with Keras and Transfer Learning

For my landmark recognition model I decided to classify the images using a Convolutional Neural Network (in this case: VGG16) pre-trained on the Google ImageNet dataset. The Keras libary includes different types of CNN architectures like the used VGG16. Two other types would be for example ResNet or Inception. 1,000 different categories of everyday things, such as species of dogs, cats, various types of household objects, vehicle types and so on. The ImageNet dataset includes all these daily objects. In order to classify the landmark, we apply ‘transfer learning’ strategy, such as extraction and fine-tuning, and use that to pre-trained on ImageNet dataset. During the training we froze the first 15 layers of the CNNs, so the network can learn about the images outside the ImageNet dataset’s images.

VGG16 architecture (Source:

The network architecture of VGG16 is in the paper of Simonyan and Zisserman in 2014 (paper).

The ’16’ means that there are 16 layers connected in this CNN. By the way there’s a VGG with 19 layers. Its name is VGG19, of course. The default input size for this model is 224 x 224. However, I have changed the resolution to 128 x 128 pixels for better performance.

Bottleneck features

Now let’s jump into code and let’s see, which steps. Bottleneck Features (i.e., the last activation maps before the fully-connected layers in the original model). After that I train a small fully-connected network on the bottleneck features, so we get the classes as outputs for our problem. The following bottleneck features of the VGG16. Furthermore I use an ImageDataGenerator to rescale the images (full code: here)

Train the top model

With the bottleneck features saved, now I’m ready to train our top model. I define a function for that, called ‘train_top_model ()’. I create a small fully-connected network using the bottleneck features as input. CNN, because that would destroy the learned weights in the convolutional base. In consequence of this, we just want to be able to start fine-tuning a trained top-level classifier.


This model is not yet performing because the weights are still the weights of the ImageNet – in conclusion, we have to fine-tune it. Alongside the top-level classifier we fine-tune the last two convolutional parts and freeze the first 15 layers. First we have to make the old weight and build the convolutional base of the VGG16. 15. We also use augmentation for the training images (detailed information in Hauke’s blog post: here). A ‘fit_generator’ helps to feed the data to our RAM in batches. Without it you want to get ‘out-of-memory-error’.


This part on the usage of a GPU in this context. Training a CNN gets a big time boost with a GPU or even multiple GPUs. So for this Landmark Recognition Challenge I decided to go to EC2 Instance of AWS, which already has one Tesla K80 GPU integrated. At the end I got a validation accuracy of nearly 80 percent and a loss of about 0.9, what is a big success. Even if you use a GPU the training takes several days. It’s nothing new that VGG16 is very challenging because of its deepness. Immeasurably. CPU would be a tedious job &# 128512;

Training of the CNN – terminal view


Now we are ready to pass our test images through the network. I have a landmark class from 1 to 15,000 and the related prediction score for this classification. I’ve already put together a submission file, as I have already mentioned the header ‘The Landmark Recognition Challenge’.

Which landmark?

It is important to note that this step is ignored. First of all I predict the class label, which would be the right solution. The differentiation if there is even a landmark or not, I want to face in the next prediction step. However, in the following I show the code for predicting the landmark label:

In order to predict the landmark, we need to run through the same pipeline as before. I predict the predictions of the Convolutional Neural Network. Subsequently I decode the predictions and map. And keep in mind to clear the Keras session after each run!

Is the prediction correct?

How do you decide for or against the predicted landmark? The answer is called DELF (DEep Local Features). As the name suggests, it is a method to extract local features from images and compare them. In our case DELF helps to match two images containing the same landmark and to obtain local image correspondence. So it perfectly fits to landmark recognition. DELF was newly developed and introduced in this paper.

DELF architecture (Source: DELF-paper)

The architecture includes a mechanism that is trained to select features with the highest scores (yellow). On the right side of the DELF pipeline is used to find some matches between a query image and some database images. The index supports querying by retrieving nearest neighbor (NN) features. Additionally the image’s correspondences are based on geometrically verified matches.

For example, the image below illustrates the visualized feature correspondences between two images. The specific ‘landmark’ should be our NovaTec head office in Leinfelden-Echterdingen.

DELF matches – head office NocaTec

Implementation of DELF

It takes three steps to identify a predicted landmark in a test image. You can find them in my repository.

  1. Extract features: DELF extraction form an image list
  2. Find the matches: Comparasation of features to get the matches
  3. Decide the result: Decision as from much correspondences (inliers) the landmark is true

After creating the DELF features the only thing we have to do is find the matches between these features. As mentioned we do this with geometrical verification in Ransac. Through the returned ‘inliers’, I can measure the DELF correspondences. It means that there is a value that is higher than 35 contains the predicted landmark.

When putting all together we get a pipeline as shown below.

Full landmark recognition pipeline

We start with classifying the test image into one landmark class. We check the given class with the DELF features by extracting the images out of the classified landmark folder. If 20 comparisons ran through or the threshold value of inlier (35) was exceeded, a result is returned. The result could be a classified as a landmark or no landmark.

Lessons Learned

During the development of my landmark pipeline and my participation at the Kaggle Landmark Recognition Challenge I learned a few things.

The first point concerns the availability of my hardware architecture. It is important to have enough memory available and a powerful processor. You’re dealing with a lot of data. In my case I only had a memory space of 50 GB on my virtual machine, so I decided to minimize the resolution of my images. But with higher resolution the landmark recognizer could work better. One training step takes about three days on one GPU. If you have more GPUs available, you could multiprocess your application on these GPUs. Your training process wants to be faster and you could iterate more often to optimize your application.

A Kaggle Challenge takes about three months. For the full three months you should concentrate on optimizing your machine learning result. You want to have a big disadvantage, if you decide to participate in a challenge from the middle of the duration. The other participants always want to be one step ahead.

Do not focus yourself too much on one model solution. If you try more than one solution, you can decide on the best one or put multiple techniques together. During my participation I focus on just one solution and tried to optimize it. Afterwards I think it would be better to try other strategies and techniques.

Communicate with other Kaggle participants or participate in a team to share the ideas. Besides new ideas, another advantage would be the availability of more computing capacity. You can split the test dataset and deliver the results more quickly.


As you can see, in my first Kaggle Challenge I figured out a lot of new things. I hope you will accept these proposals in your Challenge, too. Kaggle Challenges have the significant advantage to learn the subject of Machine Learning and get a little bit of financial support, if you do it well.

An overview of my code is here. If there are any questions, I would like to hear from you. Otherwise, I can only wish you much fun in your next Kaggle Challenge!


“Large-Scale Image Retrieval with Attentive Deep Local Features”, Hyeonwoo Noh, Andre Araujo, Jack Sim, Tobias Weyand, Bohyung Han, Proc. ICCV’17

Related Posts

Like this post? Please share to your friends:
Christina Cherry
Leave a Reply

;-) :| :x :twisted: :smile: :shock: :sad: :roll: :razz: :oops: :o :mrgreen: :lol: :idea: :grin: :evil: :cry: :cool: :arrow: :???: :?: :!: