Leaf Classification Project

clock11min read

clockInside TechLabs


While learning about AI at TechLabs, we realized a small project to classify plant species using a Convolutional Neural Network (CNN). We are a team of four students from the fields of business administration, economics and psychology, with little to no coding experience before joining the TechLabs Community. We learned the necessary coding skills for this project with different online courses. Also, we don’t explicitly include any code snippets in this post, but you can check out our GitHub where we provide the full code. [link]

In 2012, Neeraj Kumar et al. developed Leafsnap: A Computer Vision System for Automatic Plant Species Identification, a mobile app that identifies all 185 tree species in the Northeastern United States using pictures of their leaves. The classification process is based on a computer vision system. This system segments the leaf from its background, extracts curvature features of the leaf’s contour and classifies it into 185 tree species. With this procedure, a top-1 score of about 72% is achieved (meaning that in 72% of cases, the computer vision system classifies the tree species correctly).

Using the same dataset, that the authors of the paper

„Leafsnap: A Computer Vision System for Automatic Plant Species Identification,” Neeraj Kumar, Peter N. Belhumeur, Arijit Biswas, David W. Jacobs, W. John Kress, Ida C. Lopez, João V. B. Soares, Proceedings of the 12th European Conference on Computer Vision (ECCV), October 2012

provide, we asked ourselves if we could possibly beat the traditional computer vision system’s performance by implementing the classification task using a CNN.

In this post, we want to give a full overview of all the steps we took to train our CNN to classify the Northeastern American tree species from the dataset. Due to limited computational resources on our private hardware, we used Google Collaboratory, Google’s free cloud service for developing deep learning applications on a GPU. Once again, feel free to use the link to our code to see instructions on how to get ready to work with Google Collaboratory.

Since Convolutional Neural Networks are state-of-the-art for image recognition, we implemented such network to apply it to our leaf classification problem. To maximize our learning success, we did not use a pretrained CNN, but built our own network architecture. We also wanted to see how well the CNN performs compared to the other classifiers we trained. So, we tried different classifiers and compared the results:

The accuracy scores in the table above are mean values of a 3-fold cross validation of each algorithm. You can see that we obtain accuracy scores ranging from 29% to 84%. Out of all the classifier algorithms the Random Forest algorithm performs best with a score of 84%. This accuracy score however does not exceed the score of our trained CNN which is around 89% indicating that CNNs are indeed better-performing algorithms in terms of image recognition.

The Architecture of a CNN

The architecture of a regular Neural Network usually consists of an input layer, several hidden layers and an output layer. Convolutional Neural Networks however are a special kind of Neural Network. As the name already indicates, the specialty of a CNN follows from its Convolutional Layers, which makes CNN the most commonly applied algorithm for image recognition.

A Convolutional Layer is typically the first layer of a CNN. It consists of several filters of which each iterates over distinct parts, the so-called receptive fields, of the input matrix. The input matrix is just a matrix where each value refers to one pixel of an image. You can think of this as a “translation” of an image into computer language. The iteration process is called Convolution. Each combination of a filter with the receptive fields of the input matrix yields a new matrix, also called feature map. Specifically, each value in a feature map is the result of an elementwise multiplication of the values of the filter with the values in one distinct receptive field. During the iteration, the filter starts with the receptive field in the top left corner of the input matrix and then shifts a certain number of pixels to the right. If the edge of the input matrix is reached, the filter shifts the same number of pixels to the bottom of the input matrix and iterates from left to right in the next row. The number of pixels shifted while convolving is commonly called stride and the behavior when the filter reaches the edge of the input matrix is called padding. Padding, or zero-padding equivalently, means to attach a certain number of rows and columns of zeros around the input matrix. If, for example, the input and filter matrices are of dimensions 6x6 and 2x2 respectively and if stride and padding are 1 and 0, the convolution will yield a 5x5 feature map for each filter.

Example: 6x6 input matrix with a 2x2 filter matrix yields 5x5 feature map

The general formula to compute the dimension of the feature maps is

where W and F are the dimensions of the input and filter matrix, P is the padding value and S is the stride value. The feature maps are then passed on to the next Convolutional Layer as input.

Most of the time however, there are Pooling Layers between the Convolutional Layers, which aggregate the results of prior Convolutional Layers in order to feedforward only the most meaningful features. The Max-Pooling Layer for instance passes only the highest value of information to the next layer.

Example: Max pooling/ Credit: http://cs231n.github.io/convolutional-networks/

The last layer of a CNN is a fully connected layer as it is common for regular Neural Networks. The fully connected layer follows the last Pooling Layer and has all neurons connected to the input and output.

Architecture of a Convolutional Neural Network/ Credit: https://de.mathworks.com/videos/introduction-to-deep-learning-what-are-convolutional-neural-networks--1489512765771.html

The dataset

The dataset is free to obtain from http://leafsnap.com/dataset/. It contains all 185 tree species from the Northeastern United States. There is a total of 30866 images of which are 23147 lab and 7719 field images respectively. The lab images are of high quality and appear in controlled backlit and front-lit versions, with several samples per species. Since the field images a taken from mobile devices, the quality is generally worse than the quality of lab images.

We generate a histogram to see how the number of images for each species is distributed in the Leafsnap dataset.

One can see that most tree species have approximately 100–150 images to work with. The lowest number of images a species has is around 50 images, while the highest number of images is around 350 images.

Preprocessing the data

In order to convert the images into readable input for our CNN, we have to preprocess the images. This includes the following steps:

1. Read data frame with information about pictures. This will include 30866 rows, one for every image and five columns: a unique id for every image, file paths to the normal and segmented image, the species name and whether the image was taken in lab or field.

2. Create numeric labels

3. Resize Pictures

4. Read pictures as RGB arrays

5. Randomize picture order

6. Picture input into one array

7. Normalize input features (pictures) and one-hot encode labels

8. Save input features, labels and further info

Split the data for training

We split our picture arrays (and the corresponding labels) into three parts:

  1. train-set: The data we train the model on (here 80%)

  2. development-set (dev-set): The data we use to evaluate the model’s generalization performance (to new, unseen data) during training (here 10%)

  3. test-set: The data we want to predict using our trained model (here 10%)

  • For further evaluation of results, we isolate labels (latin terms), numeric labels, filenames and source (lab/field) for the test-set

Model architecture

The model takes as input pictures of the size (m, 64, 64, 3), so 64x64 colored images (3 rgb channels).

In a first convolutional layer, 32 5x5 filters are applied with a stride of 1 (so each filter shifts over the picture in steps of 1 pixel) and no padding. Therefore, the output of this convolutional layer will be (m, 60, 60, 32). This results if you apply the formula from above:

This is fed to a max pooling layer in which a (2,2) window, selecting each maximum, slides over the input with a stride of 2, collapsing the input to an output of (m, 30, 30 ,32).

The next convolutional layer consists of 64 5x5 filters, again with a stride of 1. This generates a output of (m, 26, 26, 64), which through the next pooling layer (specifications as in the first one) is collapsed to a size of (m, 13, 13, 64).

The output of this pooling layer is then flattened into a matrix of (10816, m) (13x13x64 = 10816) and fed to a fully connected layer with 1000 nodes, which in turn outputs (1000, m) to the final SoftMax layer with 185 nodes (185 classes), giving (185, m) as the final output.

A Rectified Linear Unit Activation function is used throughout the network (except for the final SoftMax layer). It is defined as f(x) = max(0,x), so it just assigns zero to negative values and keeps positive values unchanged. This way, the ReLu-function accounts for non-linearity within our later output.

Dropout was implemented to reduce overfitting (in dropout, activations are randomly set to zero with a certain possibility (here 0.7) to prevent the model from only relying on certain high activations resulting from replicating noise in the training data instead of finding a meaningful input to output mapping.

The model was trained with an Adam optimizer as gradient descent optimization algorithm. Learning rate was set to 0.0001 with a decay of e-8 (to prevent overshooting the global minimum). As loss function, categorical cross entropy cost was used.


Our training last 400 epochs. Accuracy rises rapidly up to the 40th learning epoch. After that, only a slow increase in accuracy can be observed.

The loss behaves similarly to the accuracy. We can strongly reduce the loss up to the 40th epoch. After that, the loss is only reduced in very small steps.

In both pictures we can see that the performance of training and test does not differ much from each other. This confirms to us that the applied Regulators “Dropout” and “Data Augmentation” of our model saved us from overfitting. A run without regulators showed us a clear overfitting.


With our best model we achieve an accuracy on the test set of about 89%.

However, we have to take a closer look at this result because it is an average. In order to make a strong decision how well our model perform, we need to know it´s weaknesses. Therefore we have to ask ourselves:

1. Does it make a difference whether the images come from the lab or the field?

2. Which classes can we classify particularly well with our model and which are particularly difficult to classify?

Lab vs. field

To answer this question, we have calculated and compared the amount of lab pictures misclassified with the amount of field pictures misclassified.

Error amount lab pictures: 0.058033780857514074

Error amount field pictures: 0.2570694087403599

Thus we know that the pictures from the field are recognized much worse than those created in the lab.

“ easy “ vs. “difficult “ classes

To identify on which classes the model did perform well and which were often misclassified, we had a look at species with no errors and species with the highest error rate (10 %).

Example for (one of the species with the) lowest error rate (abies nordmanniana):

Example for species with the highest error rate (ulmus glabra):

Now we know which classes are really easy to classify and which a really hard to classify. But on this point we don´t understand why. Maybe it helps to know how the model is making its decision. Therefore we generate heatmaps, which show the feature map regions that cause the most change in the output.

Heatmap for the species yielding the lowest error rate (abies nordmanniana):

Heatmap for the species yielding the highest error rate (ulmus glabra):

Examples of lab picture heatmaps:

Examples of field picture heatmaps:

We can see that for the lab pictures, often not the leaf itself but the tools lying next to the image like a ruler are used by the CNN. From this we can see, that it is able to detect patterns, but not to distinguish between those that are relevant for the task and those that are not. We therefore need to clean the input more!

How the leaves are really identified and distinguished can be observed better for the field images.


With our simple architecture we were able to increase the performance compared to the computer vision algorithm. Specifically, top-1 accuracy rose from ca. 72% (computer vision algorithm) to 89% (our model), training on average 133 examples for each of the 185 classes of tree species in Northeastern America. This emphasizes how promising the approach of CNNs is for image classification.

However, it seems that for tree leaf pictures taken in the lab, the color bars on the images drew our models’ attention, as they seem to have been used to classify these pictures (see the heatmaps). To really only rely on the leaves themselves, in future work, only pictures showing nothing but the leaf against a background should be used to train a CNN. This is a clear limitation in this project, especially in the light of the differences in error rates between lab and field pictures.

What is more, a dataset that is bigger and overall better balanced for the different species could be used. Also, it may be reasonable to use picture versions that have a higher resolution than 64x64 pixels as input to the CNN to further boost performance. In this project, although using Google Collaboratory, computational resources were not sufficient to do so. Especially performance for tree species with relatively small leaves could profit from this approach.

After finishing the project, we think about what we have learned. We concluded that coding looks more difficult and complicated than it is. It is possible to learn basic coding in less than 50h. Another point is that you can find helpful packages for almost everything in Python. You don’t need to understand all the mathematic behind artificial intelligence. However, it is very helpful to understand the mathematical background in general. Also, we find that it is handy to have complete and clean data, because even the preprocessing can be very time consuming.

At the end we want to thank the management team of TechLabs for their great support during the previous half year!

If you have any questions, don´t hesitate to contact us:

Mathis Erichsen mathis.erichsen@uni-muenster.de

Chris Vennemann c_venn01@uni-muenster.de

Jens Burtscheidt https://www.linkedin.com/in/jens-burtscheidt/

Jennifer Hölling Jennifer_hoelling@web.de