What do we mean by inductive bias and expressiveness?
The choices we make about how models make choices
In several of my posts I contrast architectural decisions in machine learning, including my last post on Deep & Cross Networks (DCN). In machine learning, we specify the structure of a model and then we execute an algorithm to find good values for all of the free parameters in that model. The structure we create for the model defines the types of relationships that can be encoded in the resulting model, and then the optimization algorithm determines how we find parameters conditional on the input data.
Let’s discuss four terms that recur to different degrees in papers: inductive bias, restriction bias, preference bias, and expressiveness.
Inductive bias refers to any way that a model makes choices in its learning. In Tom M. Mitchell’s framing, it is “the policy by which the learner generalizes beyond the observed training data” [2]. If you try multiple algorithms to fit a model on the same corpus, such as linear regression, decision tree models, or neural networks, typically those models will all predict different outputs for all (or at least some) data points.1 These algorithms learn different relationships because they have different inductive biases. Otherwise they wouldn’t be different algorithms.
I quite like Mitchell’s decomposition of inductive bias into two subclasses of bias.
Restriction bias is “a categorical restriction on the set of hypotheses considered”, such as how a linear regression can only fit linear relationships between features. Linear regression models are unable to predict outputs that could not be expressed as an additive formula with multipliers on each input.
Preference bias is “a preference for certain hypotheses over others”. Two models might be capable of predicting the same outputs if they learned in the same way, but because of how they learn they each have their own preference bias for their outputs. For example, linear regression with coefficient regularization will have a preference towards smaller (closer to 0) coefficients than standard linear regression, with implications on which values will be generated for any inputs.
Expressiveness is often used as the inverse of restriction bias, referring to the types of relationships that are possible in a model.
To use a painfully reductive and geometric example, let’s say we wanted a model to predict how much paint
we need to cover a wall, and our inputs are the length
and height
of the wall. For some unit of paint, the true relationship might be a fixed constant (c
) times the area of the wall, such that paint
= c
* length
* height
. Say we have a corpus with no noise in features or the label. The model only needs to learn c
, a single coefficient. Sounds easy, right?
Try it with some classic models. A linear regression can learn β0 + β1 * length
+ β2 * height
, which—after determining a fixed trio of coefficients—will not be accurate for almost any values of length and height. The restriction bias of linear regression is that it can only learn multipliers for each individual feature, meaning that linear regression is fundamentally not expressive enough to fit these multiplicative feature interactions.
A decision tree model can cut ever smaller rectangles of length and height and predict the area in those squares, which may reach decent accuracy on the corpus’s domain but won’t extrapolate outside of that range. A standard dense neural network can learn equations of the form activation
(β1 * length
+ β2 * height
), which (like linear regression) isn’t expressive enough. The neural network can learn ever better approximations through more combinations of those additive relationships, and—like decision tree-based models—with enough parameters it can reach decent accuracy on the training data domain. Like a linear regression, neither of these more complicated models can exactly fit the interacted relationship.
So this isn’t so easy after all. But it would be incredibly easy if area
was a feature in the model. With area
as a feature, those models are expressive enough to fit the true relationship.2
The traditional way we solve this is with domain knowledge. We know that the amount of paint needed is fundamentally related to the surface area, and we can generate area as a feature. Domain knowledge has historically been a crucial factor in model performance. When we add handcrafted features through domain knowledge, we selectively expand the model’s expressiveness. By adding area
as a feature we let our model learn the importance of a single multiplicative pair, but not for any other pair of input features.
The most obvious approach to the problem might be to generate every possible product between two features. In our painting example, say we know length
, height
, and—as an arbitrary third feature—surface_is_concrete
. We can create three new features, length
* height
, length
* surface_is_concrete
, and height
* surface_is_concrete
. Maybe length
* height
is the only interaction that materially matters and the other two are wasteful, but ideally an algorithm can learn that fact.
One issue is that the number of features becomes quite large. If we have k
features then we have k
* (k
- 1) / 2 interactions. This consumes computation resources. A related issue is that this greatly reduces the restriction bias of the model, making it capable of fitting a vastly larger space of models, which we may or may not want to do.
Another issue is that, in a way, the number of features stays quite small. Multiplicative interactions are just one type of relationship between features. We could need products of more than two features. Further, an obvious extension would be to include every possible ratio between features.3 Yet this is still only the beginning of expanding expressiveness, in an infinity of possible relationships. By picking multiplicative interactions we decide that those are relationships we want to find, and that our other potential relationships are ones we want to omit. This is still inductive bias, of the modeler rather than the model structure or the optimizer.
Case study: Deep & Cross Networks
This interaction problem is exactly what we want to solve with DCNs, finding multiplicative interactions (“crosses”) in an ML model. We’re on a long term march towards reducing how much domain knowledge is needed for effective ML. Part of that march is about how we can have neural networks automatically learn feature crosses.
DCNs do this by internally calculating every feature cross (twice, even, plus self-crossed terms), and then learning and combining weights on each of them. This does create k
2 coefficients, but it’s a good combination of increasing the expressiveness of the model such that it can find any interaction (and even find complex polynomials at higher levels), while still forcing the model to condense those interactions into fewer compound features internally.
The Google use case for DCNs is for categorical or binary features, and the authors’ case studies of Criteo and MovieLens datasets also contain categorical features [3, 4]. I wonder if that is important to their case, if standard multilayer perceptron (MLP) neural networks make more sense with continuous features and lose a lot of value when the features are binary. I’m not sure those authors tested any benchmarks where continuous features were dominant. MLPs can learn linear relationships and complex combinations of such, but perhaps that expressiveness is a waste when multiplicative relationships are what we need.
Analyzing categorical versus continuous features is complicated by how the DCN authors first use a layer of (concatenated) embeddings, to shrink the categorical features into fewer values. This is a layer of within-feature compression before we get to our cross layers. The authors don’t strictly test binary features or one-hot encoded categorical features; instead the embedding layer permits the model to learn a vector space for each categorical feature.
But regardless, let’s take a minute to think about models of binary features. We can’t do anything interesting relying on a continuous range of values, because there isn’t one. Instead we can only have univariate binary effects, and effects on various AND and NOT interactions between the binary values. A regular dense neural network is fairly expressive here. Consider the first hidden layer. Each output neuron can express an AND between any subset of features, from any pair all the way up to the entire set of features. We immediately get full order interaction depth. We can also easily express a NOT. We can combine AND with univariate NOTs. With 128 or 512 neurons, for example, we can express a large number of these binary conditions. In the next hidden layer we can reach complex NOT(AND()) combinations.
Why would a cross network be better than a dense MLP network? I am genuinely unsure on this point, and would be interested in hearing any hypotheses. Two come to mind:
The dense network has no prior preference for lower-order interactions over higher-order interactions. It has to learn those relationships from fitting the data. With cross layers we directly restrict to lower order interactions.
Perhaps the standard deep neural networks (DNNs) could learn this preference, but need much more data. MovieLens is only 740k rows. Criteo is bigger, at 45M. Although the Google use case is large, with 100B rows.
Does it take too many layers to get to useful NOT(AND()) relationships? Maybe the DNN needs to be deeper rather than wider. I’m not sure the authors evaluated changing the depth.
It’s still a bit of a mystery to me why DNNs would struggle in binary models. But I wouldn’t expect them to be better than DCNs, either. I do think our models should have inductive bias in favour of lower-order interactions. Where I think DNNs would add a lot of value is with more complex continuous features. The DCN authors had a situation where their individual input features lacked internal variability to exploit in modeling, so it became important to model interactions between the features.
Seen through the eyes of history
Recent developments in modeling can be viewed through their choices and implications on inductive bias. In the last decade we have seen neural networks grow immensely in capability and popularity. We have moved towards much larger and more expressive models, enabled by hardware and software advances.
But the expressiveness isn’t uniform in all directions:
Our models still have a heavy emphasis on linear relationships through matrix multiplication, stacked to create complex interactions.
We saw the rise of recursive neural networks (RNN). But then they relatively fell out of favour, in some cases for convolutional substitutes (CNNs), and then more often for transformer models.
The types of transformations used are heavily tailored to the problem domain, such as convolutions with their locality and translation invariance. Then the same type of transformation is repeated ad nauseam. We stack many convolutional layers on top of each other, or many transformers on top of each other, but it is less common to interleave both.
Even with expanded technical capabilities, it is still worth our while to tailor our models to the problem domain. Ultimately we have to make choices about the inductive biases in our models. As Mitchell wrote, “A learner that makes no a priori assumptions regarding the identity of the target concept has no rational basis for classifying any unseen instances” [2]. Our models have to make choices about what to predict for new combinations of feature values—or even, for that matter, for previously observed combinations of feature values. Our predictions may or may not be right, and usually are at least wrong in some stochastic sense. We have to make choices about how our models should behave. These choices do not have right or wrong answers, so much as they have rational or ill-suited hypotheses about the world.
[1] Battaglia, P.W., Hamrick, J.B., Bapst, V., Sanchez-Gonzalez, A., Zambaldi, V.F., Malinowski, M., Tacchetti, A., Raposo, D., Santoro, A., Faulkner, R., Gülçehre, Ç., Song, H.F., Ballard, A.J., Gilmer, J., Dahl, G.E., Vaswani, A., Allen, K.R., Nash, C., Langston, V., Dyer, C., Heess, N.M., Wierstra, D., Kohli, P., Botvinick, M.M., Vinyals, O., Li, Y., & Pascanu, R. (2018). Relational inductive biases, deep learning, and graph networks. ArXiv, abs/1806.01261.
[2] Mitchell, T.M. (1997). Machine learning, International Edition. McGraw-Hill Series in Computer Science.
[3] Wang, R., Fu, B., Fu, G., & Wang, M. (2017). Deep & Cross Network for Ad Click Predictions. Proceedings of the ADKDD'17.
[4] Wang, R., Shivanna, R., Cheng, D.Z., Jain, S., Lin, D., Hong, L., & Chi, E.H. (2020). DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems. Proceedings of the Web Conference 2021.
On a continuous scale, that is. Two classifiers may both classify a data point as a positive, but they will have different probabilities.
The decision tree models can’t literally find the multiplicative relationship between area
and c
the way linear regression or a neural network could, but with enough parameters and data they can approximate that curve.
Which is a whole other topic that I will refrain from diverting into now.