Everything is automatic and effortless with machine learning algorithms, uhu? Big data is all you need, after all. You have a dataset, you split it when necessary, you take one machine learning model, you train it and the miracle of a correct classification or prediction shines on you, your name, your business. Artificial intelligence is easy, isn’t it? No, it is not: this is only advertising.

“Most of the Machine Learning talks present beautiful cases of success, but in reality models often fail to deliver the desired performance“, states Rafael Garcia-Dias in the introduction to his speech at Codemotion Milan 2019. “It is not uncommon to see developers blaming certain models and even blacklisting certain models.”

Rafael is a Research associate at the King’s College in London with the main focus on developing machine learning models to diagnose patients based on structural MRI. In many cases, he found that many different trial-and-error processes are needed to find a good data/algorithm combination, if one exists.

And data are nothing without control over the problem you are facing. “Only when you know them you can think about your model”, simplifies Rafael. “Be sure you understand your problem”: if you don’t have enough data, then generate them, also if this could prove expensive.

A good path from astrophysics to neuroscience

Automated learning can help in branches of knowledge that are fascinating but inaccessible to the human mind. Garcia-Dias sports really amazing examples in his career. He invested time in testing the chemical history of galaxies. “With machine learning tools you can understand where interstellar gas each of them comes from”. Your data constraints limit your performances: “not all clusters are distinguishable with today’s approaches”.

The second example of Rafael’s work is an analysis of MRI scans. “We determine the brain age, then we compared it with the real age of the person”, flashes the King’s College researcher; “the results can help diagnose some important diseases in time”.

Linearity is broken

A common mistake researchers do is growing misleading expectations about the process’ linearity. First of all, each model has its own limitations, that the coder must know to be sure it will match the desired results.

A viable example of Rafael’s experience is based on the k-means algorithm. There is a need to understand the underlying assumptions of your model. In k-means the basic distance is euclidean, and this is one constraint to the use of this model. It rarely works as is, but it often needs much work on data and parameters. Moreover, there are many viable alternatives, such as GMM and DBscan.

Gaussian Mixture Modelling is an extension to the k-means algorithm assuming that all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters. The Scikit-learn library allows using of GMM with several alternative strategies.

Density-based spatial clustering of applications with noise groups together points with many nearby neighbors.

Garcia-Dias tested these three algorithms with different parameters, showing that very little changes can strongly alter the homogeneity score. You can limit the number of trial tests if you feel your data.

Deciding what model suits best to yourself-real world apps

Many different algorithms are on the market with many libraries. You have to know what is behind the code of any of them, to make good use of them. This great variety of development tools could generate a problem of choice. Coding for ML can look strange but it’s more or less like any other kind of programming: if you now an environment, you can learn any other environment.

Existing libraries can look not good enough to a particular goal, so the researcher could think of writing its own code. I this a mistake, normally?

I never write new libraries myself“, answers Rafael, “because that code is highly optimized and strongly reviewed. But I often look for other libraries in different languages”. Python‘s libraries are often surpassed by R’s equivalents, to say one.

“Great programmers develop great libraries, and algorithms, all stuff that flows in the open-source software pool, sooner or later”. Each of them will have its own limitations to study and know for the best choice. It’s better to spend time looking for dummy classifiers and dummy regressors!

Conclusions

Bad models don’t exist, to be crystal clear. Some silly limiting mistakes are to be avoided: you have to be aware of all assumptions behind each model, and you need to really feel your database.

The most important advice suggests a continuous process flux: “never quit thinking”, that look best suited for both AI algorithms and real-life activities.