I found my way into machine learning via traditional, linear models. Most of the times, I applied these techniques to real-world problems in the area of marketing automation, customer intelligence, pricing, procurement optimization or forensic data analysis (fraud). But our world is not always linear and I extended my knowledge to non-linear classifier and function approximators. These are decision trees, random forests, or k-nearest neighbors.

Most commercial data sets I faced during my time at PwC, BCG, and Vistaprint were not labelled. That is, no target values were available. In these settings, I applied unsupervised techniques such as hierarchical clustering for small data sets, k-means clustering for larger datasets and self-organizing maps for more complex data sets, in order to identify patterns in the data.

In my spare time I looked into black-box models such as artificial neural networks. Black-box models are not very popular in commercial settings as they are hard to interpret and hence also hard to "sell" to executives and decision makers. I applied these models mainly to computer vision tasks or problems that involve natural language.

Extreme learning machine (ELM)

Artificial neural networks are usually trained with the traditional backpropagation procedure. One of the downsides of the backpropagation procedure is the long training time. I investigated a technique called ELM. ELM trains artificial neural networks by solving a linear system with the Moore-Penrose pseudoinverse. There is quite a controversy (link #1, link #2) going on around ELM and I personally do not like the name, but that did not prevent me from looking into this alternative. I found this technique to train artificial neural network to be especially useful in settings where you have to retrain a model very often in short amounts of time. I published a paper about ELM and it is available on Springer.

Deep learning

I try to stay up-to-date with current research and advancements in my area and as deep learning is a big thing these days, I decided to look critically into it as part of my final MSc project. In a nutshell, deep learning is about learning multiple levels of abstraction and representation. It is a continuation for representation learning. The main ingredients are: artificial neural networks that are deep with regards to the hidden layers, large amounts of labelled training data, very flexible models, and a lot of computing power.

Deep learning was able to boost the state-of-the-art results in computer vision and speech recognition. I was especially interested whether deep learning could lead to similar performance boosts in the area of natural language processing and text mining. I conducted extensive experiments for a large-scale text classification problem. While deep learning architectures could beat existing benchmark results, a similar performance boost could not be observed. More details can be found in my thesis. Please get in touch with me via the contact page to obtain a PDF copy of my thesis.


I have a special curiosity for meta-learning approaches. These are curriculum learning, transfer learning, and multitask learning.

Curriculum learning

Curriculum learning is a learning strategy that exposes training examples in a meaningful order increasing gradually the difficulty of the learned concepts. That way, previously learned concepts are exploited in contrast to a random order.

Learning advanced calculus before having understood basic mathematical operations does not make much sense. An average person goes approx. 20 years through a structured learning process. The order in which knowledge is acquired matters, and a meaningful order should lead to better learning results than a random order. Yoshua Bengio et al. showed on a variety of tasks that this concept, named as curriculum learning, does apply as well in the context of machine learning.

They ran two toy and one language modelling experiment in order to investigate the research question. One of the toy experiments is on shape recognition. Three geometric shapes must be classified into three classes: rectangle, ellipse, and triangle. Two different datasets were generated, BasicShapes and GeomShapes with the former one being limited in the variability of the shapes making it a less complex dataset. The lowest misclassification error was achieved when the classifier was first trained on the less complex BasicShapes dataset followed by the more complex GeomShapes dataset.

The language modelling experiment was performed on a deep neural network that predicts the score of the next word given the previous words. The model was trained on 631 million windows of size 5, based on an English Wikipedia data dump. The curriculum strategy for this experiment consists of starting to train the model with the 5,000 most frequent words, and then increase the vocabulary size by 5,000 in each subsequent training pass. This approach was compared against a no-curriculum strategy in which the model was trained on the full vocabulary from the beginning on. It was observed that the curriculum strategy outperforms the no-curriculum strategy shortly after the vocabulary size reached 20,000 words and approx. 1 billion updates.

Multitask learning

The common approach in supervised machine learning is to split a complex problem into easier tasks. These tasks are often binary classification problems, or the classification of multiple, mutually exclusive classes. But this approach ignores substantial information that is available in a mutually non-exclusive classification problem. A simplified example should help to illustrate these two approaches. An insurance company could be interested in cross-selling life insurances to its customers and also observe if good customers are at risk of churn. The traditional approach would be to create two separate models, one for the cross-selling of the life insurance, and one for the churn prediction. This approach is referred to as singletask learning as each classifier is solving exactly one specific task. Both tasks could however be solved in parallel, but a softmax classifier would not be suited here as the two classes are not mutually exclusive. A customer could have a real demand for a life insurance while also being at risk of churn. Multitask learning is about solving these mutually non-exclusive classification problems jointly.

Rich Caruana compared singletask and multitask learning on three different problems with multitask learning achieving consistently better results than singletask learning. He argued that multitask learning performs an inductive transfer between different tasks that are related to each other. The different training signals represent an inductive bias which helps the classifier to prefer hypotheses that explain more than just one task.

Transfer learning

Transfer learning represents the idea of multitask learning more generally. A major bottleneck in machine learning is very often the lack of good quality data sets. That is, annotated training examples. Transfer learning offers a pragmatic solution to this problem by leveraging data from related problems. More formally, transfer learning is defined as the process of extracting knowledge from auxiliary domains in order to boost the performance in a specific target domain.

In the area of natural language processing, high-quality corpora are scarce for the many languages except for English or Spanish. I made use of transfer learning for a text classification problem. I was able to boost the classification accuracy for low-resource languages, such as Danish, Dutch, Swedish, or Norwegian by leveraging English resources.

Distributed representations

Humans are able to compare two objects on different levels. A dog and a cat are not the same. However, they are both living animals and they walk usually on four feets. We humans understand these different levels of similarity. Computers however fail epicly in doing this kind of comparisons. Dense representations for words or images however make it possible to measure this fine-grained similarities across concepts. One could for example measure the similarity of words represented as vectors with the cosine and see that the vector of dog and cat is more similar than dog and chair.

Bayesian statistics

One of the next topics I want to explore further is Bayesian statistics applied to machine learning. Two application areas are of special interest for me: hyper-parameter tuning and data efficient learning.

Hyper-parameter optimization

Many (complex) models, such as artificial neural networks require a lot of decisions to be made on hyper-parameter level. Some of these are: learning rate, weight decay, dropout rate, batch size, epochs, etc. These hyper-parameters have a huge impact on the overall model performance. If you are not a brilliant genius with the incredible ability to identify an optimal hyper-parameter setting with your pure gut-feeling, you will need to explore many different configurations. As training a neural network can take long times, grid-search approaches are not always the best recommendation. Even worse, a random search can even beat grid-search approaches. Bayesian statistics can help here. A technique called Gaussian process makes it possible to find a good tradeoff between exploration and exploitation of the large hyper-parameter space.

Ethics and machine learning

As machine learning techniques have more and more influence on our daily (offline) lives, we need to think about the ethics of these new applications. It is very true that a bias (such as racial, gender, or religious) in the training data results in a similar bias in the model. For simple linear models, it is quite easy to detect the relationship between inputs and outputs. For black-box models however, this is not easily possible. I am very interested in ways to identify input-output relatioships for black-box models.

If any of these topics sounds interesting to you and you would like to collaborate on a research project, please reach out via the contact page!