May 25, 2019

ML Bothers Me, Too.

Friday night I received a passionate email from my student:

Hi Tanya! I learned from you that it's so important to make sure that features make logical sense, in order to avoid statistical mistakes (sorry I forgot the term you used about the coincidences).
But in that case, how can we rely on these self-assembling neural nets where we have no idea what the features are? Maybe they are using statistical mistakes?
Does it bother you that ML feels like a pile of hacks? So much black magic? "We don't know why but from experimenting we've found that this works best"-- What gives? Since when were researchers tinkerers instead of preferring rigor?

I got excited. First of all I like inspiring others. Besides, well, you know, ML bothers me too. So I postponed watching my online music class — again! — and jumped into the long explanations.

Long story short, if we do not know the mathematical proof yet for why something is working, but we rigorously verified it is working, why won't we use it? Example: I do not know the opening hours of the shop, but I know if I come there any time after work between 6 and 8, I can buy food :)


TL;DR

Hey ***********

Happy to hear from you :)

So phrasing we use to talk about coincidences is "correlations are not causations". Indeed, if 2 variables have correlation (mathematical dependency when one both variables will for example increase or one increases while another decreases), it does not imply they have causation (one impacts another) necessarily:

When it comes to Neural network, it indeed becomes trickier to check what happens their. This is why the first advice we have while building ML models - know your data. Before building a NN, a researcher hopefully is doing:

1. problem framing, during which she talks to business domain experts on which features potentially impact the label;

2. data analysis and preprocessing to investigate into correlations between features and feature distributions;

3. building a baseline, a simple linear/logistic regression, which identifies only the existence of linear dependencies between the features and the label and sets a certain baseline. This is the basic model quality, which NN has to beat.

4. introducing a non-linearity. If the quality of the model in step 3 is not satisfactory, a researcher may try to investigate further and assumes there might be some non-linear dependencies between the features and the label. An example of non-linear dependency can be the following: if house is 30 min walk from a bus stop, it is appealing and pricy, but if it is far then 30 min walk, nobody wants it, so you have to decrease the price significantly. There are 2 ways how to introduce non-linearity into the model. First option - by feature preprocessing. I can turn my numerical feature (distance in walk minutes from bus stop) into categorical ("less than 30 min" OR "more than 30 min"). And feed this feature back to my linear model. Second option - by feeding my features into a neural net. There is a theorem that any function can be approximated using a neural network. So we hope all sorts of dependencies will be found by NN.

However, you are right, that it is almost impossible to say how exactly features impact label. To make sure we are doing a good prediction job, on top of things listed above, we check that NN beats the quality of the linear model (otherwise what is the sense of using it?) and that it is predicting on test set (never-seen data) the same quality as it predicts on training set. This is a good enough method to verify model actually learns something, not memorizes the dataset.

At the same time, to my great disappointment, a rare practitioner would do all these steps, so they indeed under the risk of misusing ML theory. Well, I hope I could help some of them, who comes to my ML office hours or participates in my ML course.

When it comes to "ML feels like a pile of hacks" - well... yes and no. The reason why we get these many "rules of thumbs" and "hacks" is that much more engineers are involved in this very young and rapidly changing science, than mathematicians are. So the fashion is engineers find "something is working, as we suddenly found out, but we do not know why", and then mathematicians try to prove the statement retrospectively, sometimes successfully. This is why it is important to know the source of "rule of thumb". If it is based on great number of observation, it won't hurt to try it. For example, there is Inception - 23 layered neural network which is doing a good job in classifying images. We do not know why exactly these 23 layers with THAT # of neurons and THAT activation functions are working, but they are. So it won't hurt to test this architecture for your model too. There is no guarantee it would work, but you can set an experiment and see.

Bottom line: if we do not know the mathematical proof yet for why something is working, but we rigorously verified it is working, why won't we use it? Example: I do not know the opening hours of the shop, but I know if I come there any time after work between 6 and 8, I can buy food :)

Does it all make sense? Let me know :)

Thanks,

Tanya