Navigation – Plan du site

AccueilNuméros4-3History of EconometricsIndividual Heterogeneity and Stat...

History of Econometrics

Individual Heterogeneity and State Dependence: From George Biddell Airy to James Joseph Heckman

Hétérogénéité individuelle et dépendance d'état : De George Biddel Airy à James Joseph Heckman
Marc Nerlove
p. 281-320


Dans cet article, j’examine l’évolution d’une idée : que les différences entre des relations statistiques et les observations sur lesquelles elles se basent, autrement dit les résidus, peuvent avoir une signification et une structure, et qu'elles doivent être prises en compte, malgré le fait qu’elles résultent d’éléments inobservables. Lorsque l’on travaille au avec des données microéconomiques, les principales données inobservables émergent de l’hétérogénéité qui peut survenir en partie d’une dépendance de sentier. Je parviens à une conclusion perturbante, à savoir que l’on ne peut pas séparer des comportements individuels inobservables dus à une expérience passée provenant d’autres types d’hétérogénéités inobservables. Cela implique, en particulier, que nous ne pouvons pas obtenir des estimations à retards échelonnés ou d’autres modèles à dépendance de sentier qui ont une signification économique sans poser des hypothèses ad hoc exemptes de contenu économique.

Haut de page

Texte intégral

1In this paper I examine the evolution of an idea: that differences between statistical relationships and the observations on which they are based, or disturbances, may have meaning and structure and that such meaning and structure must be taken into account despite the fact that they arise from unobservables. If they are not, our understanding of what we can observe may be flawed.

2This paper is not about how we represent individual heterogeneity, with fixed or with random effects, but that we need to take account of individual heterogeneity when estimating behavioral relationships with micro economic or social data. My disturbing conclusion is that we cannot separate individual unobserved behavior due to past experience from other types of unobserved heterogeneity. This means, in particular, that we cannot obtain economically meaningful estimates of distributed lags or other models of path dependence without making ad hoc assumptions having no economic content. In order to reach this conclusion, I look closely at how the disturbance in statistical, and eventually, in econometric, relationships has been interpreted by those who first used such relationships and how they came to understand the importance of individual heterogeneity. Whether such heterogeneity was modeled with parametric fixed effects or with components of variance is not the issue. What is at issue is that economic behavior is inherently dynamic. Even when micro economic data, such as panel or longitudinal data are available to support models of dynamic behavior, it is not possible to identify models of path dependence which separate out other sources of heterogeneity among individuals.

3I begin with a few stories from the history of statistics to show how the notion of disturbances arose, primarily in the analysis of astronomical data. I continue with detailed accounts of the work of George Biddell Airy and R. A. Fisher’s work on the so-called intra-class correlation. I follow with a brief discussion of Henry Daniels and Churchill Eisenhart, which makes clear the distinction between fixed and random-effects in the disturbances for dealing with individual heterogeneity. This distinction had been somewhat muddied by Fisher. Developments in the statistical literature related to giving content to the disturbances are referenced. Finally, the treatment of the residual in econometrics by Haavelmo, Marschak and Hildreth is discussed with brief digressions on the work of Mundlak, Hochman, Balestra and Nerlove and the debate on the appropriateness of fixed versus random effects in the analysis of micro economic and social data. I conclude with the work of James Heckman on heterogeneity and state dependence. Heckman, more than anyone, has pioneered in the application of the idea that relationships based upon micro-economic or social data, panel data, longitudinal, and event histories depend on unobservables which have structure, such as path or state dependence, or reflect otherwise individual heterogeneity. The problem is, as Heckman makes clear, that state dependence and heterogeneity are generally not separately identifiable. This means, in particular, that, for non-experimental micro data, estimation of dynamic relationships designed to capture state dependence is problematic.

1. Some Prehistory: Gauss, Laplace and Statistics in Astronomy

4The history of statistical ideas is both fascinating and useful. The problem of scientific inference is to draw conclusions about the process which has (may have) generated a body of data (data generating process= DGP). Often the same person who seeks such conclusions is himself a part of the process which has generated the data, as in the experimental sciences or in the collection of survey data; more often in econometrics the analyst has not been part of the DGP, which may involve data collected for regulatory purposes or interpolated from censuses. The question of what constitutes a satisfactory method of inference has had a long and somewhat tortured history from its early beginnings in the eighteenth century to the present day, and is still not fully resolved. It is useful to review this fascinating story briefly and selectively both because older ideas about inference continue to offer insight to present-day practitioners and because the confusions and groping towards solutions of the past are a lesson to those who think they now know all the answers. In this selective review, I am particularly concerned about the relation between the statistical model of the DGP and the data generated by the DGP, that is to say, in commonplace terms, "Why doesn’t the model fit exactly?" What are the reasons for the discrepancies, “errors”, the “residuals” or the “disturbances”? Sometimes what we can observe is not the whole story, or even the most important part of the story.

  • 1 Lehmann (1993) gives a nice summary of the ideas of Fisher and Neyman and Egon S. Pearson about tes (...)

5First, how did the prevailing paradigm of independently, normally distributed disturbances arise in statistics in the work of Fisher, Neyman-Pearson and carried over into the Cowles paradigm of econometric inference by Haavelmo and Marschak?1

  • 2 J. Bernoulli (1713) attempted to give a formal basis for the commonsense notion that the greater th (...)
  • 3 The term inverse probability is not used by Laplace (1774) but appears, according to Edwards (1997, (...)

6I begin in the eighteenth century. By the end of the seventeenth century the mathematics of permutations and combinations had been extensively developed in conjunction with the analysis of games of chance by such mathematical luminaries as Fermat, Pascal, Huygens, Leibnitz and Jacob Bernoulli (one of twelve of that illustrious mathematical family). The origins of the theory of probability in the study of games of chance are of some significance in two respects: First, it is natural to construct the “events” or outcomes as composed of equally probable cases. Second, focus on the stochastic nature of the outcomes themselves tends to obscure, or at least divert attention from, the mechanism generating the data or observations, since, in the case of games of chance, such mechanisms are perfectly known in general. With respect to the latter, Jacob Bernoulli’s 1713 treatise, Ars Conjectandi, changed the focus from observed outcomes to the underlying “causes” or DGP.2 The former, composition of complex outcomes in terms of equally likely elementary events, translates into equally likely or uniformly distributed causes in application of Bayes’ Theorem (1764) with a uniform prior. And Bayes’ Theorem with a uniform prior is nothing more or less than Laplace’s (1774) independent development of the principle of inference by inverse probability.3 The assumption of uniform priors in Bayesian analysis was later to be called the principle of insufficient reason and leads to serious conceptual difficulties and paradoxes, in large part responsible for Fisher’s (1922, 1932) rejection of inverse probability inference.

7Until Fisher (1922) there was not a sharp distinction between the likelihood of a series of observations and the posterior probability of those observations with non-informative prior, i.e., their inverse probability, but the question of how to use the likelihood or inverse probability to draw an inference about their DGP was already a subject of investigation in the eighteenth century. The idea of combining several measurements (plausibly independent) of the same quantity by taking their arithmetic mean had appeared at the end of the seventeenth century. But was this the best way of combining the information in the several observations to obtain a more accurate measurement of the quantity in question than that in any one of them? Good historical accounts of the problem of combining observations and its relation to the development of least squares by Legendre in 1805 and Gauss in 1809 are given in Plackett (1972) and Stigler (1986, Chapter 1). Simpson (1755) took the first crucial step forward by focusing not on the observations themselves but on the errors of measurement. That freed Laplace (1774) and Daniel Bernoulli (1777) to concentrate on the properties that the distribution of the errors of measurement ought to have. Bernoulli then argued that the “true” value of the magnitude being measured ought to be taken as the value which maximizes the probability of the whole set of observations with such a distribution of errors. This is clearly what we call today the method of maximum likelihood. Unfortunately, he made a poor, one might even say truly awful, choice of error distribution and came up not with the mean but something quite different even for two or three observations. And solving the problem with his choice of error distribution with more than three observations is horrendous. Still, Daniel Bernoulli usually gets the credit for having invented the method of maximum likelihood.

8Thinking of the data as generated by observations not of the “true” magnitudes of variables of interest but as those values generated with error led statistical development in an unfortunate direction for later use in the analysis of social and economic data. It was not until the end of the century and the beginning of the next with the work of Galton, Karl Pearson and Fisher that a more nuanced approach appeared. The notion of errors of observation, however, played a fundamental role in the further development of statistics and its relation to astronomical observation, and continues to do so. Airy’s work, discussed in the next section of this paper continues the focus on errors of observation, albeit with an important twist.

  • 4 De Moivre, of course, had it as early as 1733 in an unpublished memorandum later published as part (...)

9Returning to the focus on errors of observation: Laplace also stumbled over the form of the error distribution (see Stigler, 1986, 105-122). Resolution of the matter had to wait until Gauss (1809) reinvented the normal distribution.4 The principal subject of Gauss’s book was an investigation of the mathematics of planetary orbits, but at the very end he added a section on the combination of observations. Gauss considered a somewhat more general problem than the estimation of one “true” value from a series of measurements subject to error: The estimation of a linear equation with constant coefficients, known independent variables, and observations on the dependent variable subject to error, εi. Assuming the errors to be independent of one another and distributed with unknown distribution, φ(εi), Gauss proposed to estimate the unknown coefficients by maximizing L=Πφ(εi). So far nothing beyond what Laplace and Bernoulli had done in the case of simple measurements, but now Gauss took a new and inspired direction: Rather than starting with the principle of insufficient reason, imposing some properties the distribution of errors ought to have (such as symmetry about zero and lesser probabilities for large errors than for small ones), then arriving at a suitable form for φ, Gauss reversed the procedure. Instead of imposing further conditions directly as Laplace and others had done, he assumed the conclusion! He reasoned that in the case of a single value measured numerous times with error, dozens of eminent and not so eminent astronomers couldn’t be wrong: the arithmetic mean must be the correct answer. He then proved in this case that the arithmetic mean of the observations maximizes L only when

Image 10000201000000760000002DC943ECDA.png

for some positive constant h, which could be interpreted as a measure of the precision of observation. Extending this result to the more general case of a linear equation with unknown coefficients and known explanatory variables, dependent variable subject to errors of measurement, Gauss then showed that maximizing L yields the least squares estimates of the unknown coefficients. In this way, the normal distribution and least squares were connected. The circularity and boldness of this argument are breath-taking. And why should a distribution which solves one special case, a single value measured several times over with error, generalize to a much more complex problem?

  • 5 It interesting to note that Marschak (1953) uses exactly the same justification for supposing that (...)

10Laplace had just the right justification for the distribution of errors which Gauss had shown led to least squares: if the errors were caused by numerous insignificant causes, each negligible in effect, but summed together, then they ought to be distributed normally more or less.5 He rushed into print with a supplement to his memoir. Laplace (1812, 1820) went even further. He showed that all estimates of the coefficients which are linear functions of the independent variables are approximately normally distributed and that, within this class, the ordinary least squares (OLS) estimates have the smallest expected squared error. He further derived the multivariate normal as the limiting distribution of two or more least squares estimates. Perhaps piqued by Laplace, Gauss (1823) had second thoughts concerning his “derivation” of OLS as the “maximum likelihood” solution from a normal distribution of errors. He noted that the analysis really depended on second moments and that if one was content to measure the accuracy of estimates which were linear functions of the observations by their expected squared error then his conclusion held without regard to the distribution of the errors (as long, of course, as they were i.i.d. (independent and identically distributed) with zero mean and common variance). He freed the method of least squares from the assumption of normally distributed errors and thus from Laplace’s asymptotic justification, producing what we call today the Gauss-Markov Theorem.

11The Gauss-Markov Theorem, OLS and the Laplacian justification of the normality of errors serve as the work horses of statistical inference and particularly of econometric inference until this very day. But a major chink in the armor of this monolithic approach began to appear in astronomical investigation in the middle of the nineteenth century in the work of Airy, whom I discuss next.

2. George Biddell Airy (1801-1892): Random Effects in the Analysis of Astronomical Data

12George Biddell Airy was born July 27, 1801 at Alnwick, Northumberland, England, and died six months short of his ninety-first birthday at Greenwich, England, on January 2, 1892. He went down to Cambridge to study mathematics, becoming successively Senior Wrangler in 1823, Fellow of Trinity College in 1824, and Lucasian Professor of Mathematics in 1826, a professorship once held by Isaac Newton. He was appointed Astronomer Royal of England and Director of the Royal Observatory at Greenwich in 1835, a post he held until 1881. He was knighted by Queen Victoria in 1872. Although the Greenwich meridian, Longitude 0o, had been used by seafarers since 1767 as a reference point for time and longitude (Sobel, 1995, 166), it was Airy’s precise measurement of the location of the meridian by means of an instrument he invented which made it the universal standard.

13Until Airy wrote his monograph (1861) on the transit of Jupiter, astronomy and all statistical applications were dominated by the Gauss-Laplace formulation in which independent errors of observation were assumed. The main problem to consider was how these errors were distributed or, at least minimally i.i.d. Now Airy changed the focus away from independent errors of observation.

  • 6 Reference to, and a brief discussion of Airy’s monograph, are to be found in Scheffé (1956), who cr (...)

14He makes explicit use of a variance-component model for the analysis of astronomical panel data.6 Here is how (1861, 92) Airy puts the problem. Note that what Airy calls a Constant Error we would call a random day-effect:

When successive series of observations are made, day after day, of the same measurable quantity, which is either invariable...or admits of being reduced by calculation to an invariable quantity...; and when every known instrumental correction has been applied (as for zero, for effect of temperature upon the scale, etc.); still it will sometimes be found that the result obtained on one day differs from the result obtained on another day by a larger quantity than could have been anticipated, the idea then presents itself, that possibly there has been on some one day or on every day, some cause, special to the day, which has produced a Constant Error in the measures of that day. It is our business now to consider the evidence for, and the treatment of, such constant error.

Continuing (93-94), Airy writes:

First, it ought, in general, to be established that there is possibility of error, constant on one day but varying from day to day. ...suppose...that we have measured the apparent diameter of Jupiter. It is evident that both atmospheric and personal circumstances may sensibly alter the measure; and here we may admit the possibility of the error. ....Now let us take the observations of each day separately, and...investigate from each separate day the probable error of a single measure. We may expect to find different values (the mere paucity of observations will sufficiently explain the difference); but as the different observations on the different days either are equally good, or (as well as we can judge) have such a difference in merit that we can approximately assign the proportion of their probable errors, we can define the value of error for observations of the standard quality as determined from the observations of each day; and combining these with greater weight for the deductions from the more numerous observations, we shall have the final value of the probable error of each observation not containing the effects of the Constant Error.

15Airy goes on, on subsequent pages, to develop verbally the following model: Let us observe the phenomenon, say the apparent diameter of Jupiter, on I nights, with Ji observations being made the ith night. Let the measurement be yij; then

(1) Image 10000201000000FB0000001B64ECAE15.png

16where μ is the “true” value, and {δi} and {εij} are random effects with the following interpretation: δi is what Airy calls the Constant Error associated with day i, what we would call the “day effect,” i.e., the atmospheric and personal circumstances peculiar to the ith night, and εij is all the rest, or the errors about the conditional mean, μ + δi, on the ith night. He assumes that the εij and δi are each independently and identically distributed and independent of each other and have zero means. Let the variances of δ and of ε be, respectively, σ2δ and σ2ε, and suppose, for simplicity, J equal numbers of observations each night (a balanced panel). To make his point, Airy wants to reject the hypothesis that σ2δ = 0. He computes an estimate of the “within” variance for each night i as

Image 10000201000000A20000003630CAE471.png.

17Airy then takes the arithmetic mean of the square roots to estimate the root of σ2ε:

Image 10000201000000830000003357AC6278.png.

18To estimate σ2δ Airy uses not the between-nights sum of squares but the corresponding mean absolute deviation:

Image 1000020100000076000000304C3C82C8.png.

He then calculates an approximate probable error for d from a standardized normal by replacing Image 10000201000000160000001A7DAE57C5.png by Image 10000201000000160000001B9BAE3E22.png and μ byImage 100002010000001200000017F6F6B1A8.png. The calculated value of d being larger than this value, Airy rejects the hypothesis of no night effect. Although the details of Airy’s analysis seem a bit clumsy from a modern point of view, the spirit of his model and calculations are the same as used by Balestra and Nerlove (1966) in their justification and use of a variance-components, random-effects model.

  • 7 William Chauvenet (1820-1870) was professor of mathematics at the U.S. Naval Academy in Annapolis f (...)

19Only a few years later William Chauvenet (1863) published the first edition of his two volume text on spherical astronomy, which became the standard reference work until the end of the century.7 His calculations of the probable error of transit observations (1863; fifth edition, 1889, 194-200) uses the estimate

Image 10000201000000A90000003001851F58.png.

  • 8 Hald (1998, 675) mentions two additional precursors of Fisher: Edgeworth (1885) and Thiele (1903).

20This formula is exactly the one used in a random-effects model for the analysis of panel data. Thus the idea of variance components was well established long before Fisher, who is discussed in the next section, wrote about the intraclass correlation in 1925. Indeed, Francis Galton (1889) introduced the concept, although under another name, and used a variance-components model in his work on human inheritance and his anthropometric investigations (See Stigler, 1999, 182).8

3. Ronald Aylmer Fisher (1890-1962): Fixed Effects and Random Effects

  • 9 See the discussion of Hoch (1954) and Mundlak (1961) below.

21Fisher’s work established variance components, or random-effects models, as a method of allowing individual heterogeneity to play a role in the analysis of biometric data. But he also invented ANOVA tables, that is, fixed-effects models, for allowing for individual heterogeneity in agronomic data. This work represents another approach to individual heterogeneity using fixed-effects, as it is generally called in the econometric literature.9

22Fisher was born in London in 1890 and died in Adelaide, Australia, in 1962. A detailed biography of his life and work has been written by his daughter, Joan Fisher Box (1978). Much of Fisher’s work was concerned with agronomic research at Rothamsted.

  • 10 If there are Q variables there are, in general, (...)
  • 11 This was not always so. Fisher’s battles with the experimental establishment to introduce randomiza (...)

23Suppose we are evaluating two varieties of high-yielding rice. We want to know how each variety responds to fertilizer application and to water availability, so we design an experiment in which each variety is planted several times over and is subjected to various determined and accurately measured levels of fertilizer and water application. At the end of the day, we measure the yield of each variety on each plot and for each combination of fertilizer and water application. If we have designed the experiment well, varieties are allocated to plots and treatments in a random manner. Clearly, there are a great many unobserved factors affecting the yields of each variety observed besides water availability and level of fertilizer application, most of which have to do with the particular plot. Suppose that we distinguish three levels of fertilizer application: low, medium, and high, and three levels of water application: low, medium, and high. The standard fixed-effects ANOVA model consists of an over-all mean, a main effect for each of the factors: variety, fertilizer and water, represented respectively by one, two and two parameters, three bivariate interaction effects, and one trivariate interaction.10 The treatment levels and varieties can be represented by dummy variables with appropriate restrictions, so that this ANOVA problem can be treated as a regression problem in which rice yield is the dependent variable and the observed independent variables are the dummies and various products thereof. The disturbance is assumed to be a random variable, independent of variety and treatment levels, which represents all the left-out variables associated with plot. This is the kind of problem Fisher (1925) considered in detail. The important thing to note is that variety and fertilizer and water treatment levels are fixed by the experimenter; there is no thought that they might have been selected from a larger, possibly unknown, population of varieties or levels. On the other hand the plot effects can be considered random draws from an unknown population of unobserved plot-specific factors. In an experimental context, these effects are “controlled” by randomization.11

  • 12 See also Moran and Smith (1966). Fisher (1918b) was the paper submitted first to Biometrika that Ka (...)

24 The terms variance and Analysis of Variance were both introduced by R. A. Fisher in his famous and seminal papers on quantitative genetics (1918a) and (1918b).12 The concepts and methods of both fixed-effects and random-effects models were elaborated greatly in Fisher (1925), especially in Chapters 7 and 8, “Intraclass Correlations and the Analysis of Variance,” and “Further Applications of the Analysis of Variance.” But Fisher was never clear on the distinction between the fixed-effects model and the random-effects model. In sec. 40, Chapter 7 (page references are to the 1970 reprint of the 14th edition), Fisher (1925, reprinted 1970, 234) writes the usual ANOVA table for assessing the significance of the variation of the heights of brothers from the same family across families, i.e., the table appropriate for the question: Can the family “effect” account for a significant part of the total variation in heights? He then goes on to interpret the problem in terms of the proportion of variance attributable to the “family effect,” with a clear “random-effect” flavor (225-226):

Let a quantity be made up of two parts, each normally and independently distributed; let the variance of the first part be A, and that of the second part B; then it is easy to see that the variance of the total quantity is A + B. Consider a sample of n’ values of the first part, and to each of these add a sample of k values of the second part, taking a fresh sample of k in each case. We then have n’ families of values with k in each family. In the infinite population from which these are drawn [italics supplied] the correlation between pairs of members of the same family will be

Image 10000201000000480000002B8D8B00A7.png.

From such a set of kn’ values we make estimates of the values of A and B, or in other words we may analyse the variance into the proportions contributed by the two causes; the intraclass correlation will be merely the fraction of the total variance due to the cause which observations in the same family have in common [italics supplied]. The value of B may be estimated directly, for the variation due to this cause alone, consequently [note that Fisher uses S where we would use Σ and doesn’t bother with expectation operators]

Image 10000201000000A80000002B663316BC.png

The mean of the observations in any family is made up of two parts, the first part with variance A, and a second part, which is the mean of k values of the second parts of the individual values, and therefore a variance B/k; consequently from the observed variation of the means of the families, we have

Image 10000201000000D50000002BD2A315C4.png

  • 13 Scheffé (1956), Anderson (1978), and Searle, Casella and McCulloch (1992, Chapter 2), give more det (...)

25Although Fisher may have been perfectly clear in his own mind what the distinction between fixed effects and random effects was, by eschewing the use of the expectation operator and working from a standard ANOVA table but giving it a population interpretation appropriate to a random-effects model, he greatly muddied the waters for those who followed.13

4. Henry Ellis Daniels (1912-2000) and Churchill Eisenhart (1913-1994): Fixed versus Random Effects

26The distinction between random effects and fixed effects, and its importance for the analysis of non-experimental versus experimental data was not clarified until Churchill Eisenhart’s “survey” in 1947. Although Daniels (1939) had it substantially right, his paper was largely overlooked until much later.

27Daniels (1939, 190) puts the matter as follows:

The requirements of the particular problem decide whether the systematic or random interpretation is relevant. From the machine-makers point of view, for instance, the k units might be more properly thought of as a sample taken at random from all possible units. But when the aim is to measure and reduce variation in the output of a given plant, the systematic interpretation appears to be the correct one. Cases may, of course, arise where a combination of the two types of factor is operating...

28Eisenhart (1947, 19-20) is much clearer but less succinct; he writes,

In practical work a question that often arises is: which model is appropriate to the present instance — Model I [fixed effects] or model II [random effects]? Basically, of course, the answer is clear as soon as a decision is reached on whether the parameters of interest specify fixed relations or components of random variation. The answer depends in part, however, on how the observations were obtained; on the extent to which the experimental procedure employed sampled the respective variables at random. This generally provides the clue. For instance, when an experimenter selects two or more treatments or two or more varieties, for testing, he rarely, if ever, draws them at random from a population of possible treatments or varieties; he selects those that he believes are most promising. Accordingly, Model I is generally appropriate where treatment, or variety comparisons are involved. On the other hand, when an experimenter selects a sample of animals from a herd or a species, for the study of the effects of various treatments, he can insure that they are random. ...But he may consider such a sample to be a random sample from the species, only by making the assumption that the herd itself is a random sample from the species. In such a case, if several herds (from the same species) are involved, Model II would clearly be appropriate with respect to variation among the animals from each of the respective herds, and might be appropriate with respect to the variation of the herds from one another…The most difficult decisions are usually associated with places and times: Are the fields on which the tests were conducted a random sample of the county, or of the state, etc.? Are the years in which the tests were conducted a random sample of years?

29Eisenhart is raising here a rather deep problem in the theory of probability. Whether or not a particular variable can be considered a random draw from some population or not is, in principle, decidable by applying the principle of ”exchangeability“ introduced by de Finetti (1930); see also de Finetti (1970; trans. 1990, vol. 2, 211-224) and Lindley and Novick (1981). In a nutshell, the idea, very Bayesian in flavor, is to ask whether we can exchange two elements in a sample and still maintain the same subjective distribution: Thus, in a panel study of households are any two households in the sample exchangeable without affecting the distribution from which we imagine household effects to be drawn? In a panel of state data, are California and Maryland exchangeable without affecting the subjective distribution of the state effects? It’s a dicey question in many cases.

5. Further Developments in the Statistical Literature

  • 14 See especially Rao (1946). Rao (1971a, b) works out details of the MINQUE alternative to maximum li (...)
  • 15 As, for example, in the ANOVA interpretation of log-linear probability models for the analysis of c (...)
  • 16 Anderson and Bancroft (1952, 313-377) have five chapters (22-25) on variance-components models and (...)

30The ANOVA table, which is the heart of the analysis of variance method proposed by Fisher (1925), is simply a way of arranging a series of calculations involving the partitioning of sums of squares of observations. What one does next depends, as suggested in the preceding discussion, on the context. For the greater part of agricultural and industrial research, which involves the statistical analysis of experimental data, the fixed-effects model is appropriate. Moreover, the fixed-effects interpretation lends itself to an analysis by linear methods and to estimation and hypothesis testing by least-squares methods.14 Although a random-effects interpretation is appropriate for many types of agricultural research, as suggested by Eisenhart’s examples of animal breeding experiments and Fisher’s own genetic research, it is principally adapted to the analysis of non-experimental, observational data such as are common in astronomy and the social sciences. Moreover, the easy treatment by linear least-squares methods is not possible where heterogeneity is represented by random effects. Generally speaking, likelihood methods, or related approximate methods, are required for analysis. And these were, before the advent of high-speed electronic computers, very computationally demanding. This is especially true for models in which the variable or variables of interest are also latent.15 Snedecor’s influential texts (1934; 1937-1980) make no mention of random-effects or variance-components models and deal almost exclusively with fixed-effects models in which the only random effect is the overall disturbance. In 1952, two important books appeared, Rao (1952) and Anderson and Bancroft (1952), which set the agenda for the work of the generation which followed, although they contain some discussion of variance-components models, emphasized fixed-effects models and the relation of ANOVA to least squares.16 Scheffé (1956), following Eisenhart (1947), gave a balanced discussion of the two types of models, but emphasized (sec. 3, 257-264) the difficulties associated with the analysis of random-effects models. In his definitive 1959 book, Scheffé (1959) describes his goal as “a unified presentation of the basic theory of the analysis of variance — including its geometrical aspects.” But in the preface (vii) he writes:

The theory...for fixed-effects models with independent observations of equal variance, I judge to be jelled into a fairly permanent form, but the theory of...other models [random-effects and mixed models], I expect will undergo considerable extension and revision. Perhaps this presentation will help stimulate the needed growth. What I feel most apologetic about is the little I have to offer on the unbalanced cases of the random-effects models and mixed models. These cannot generally be avoided in planning biological experiments, especially in genetics. ...This gap in the theory I have not been able to fill.

  • 17 At least part of the responsibility for the dominance of the fixed-effects model must be laid at th (...)

31Indeed, Scheffé devotes only 40 pages in a 477 page book to random-effects models. In the next section of this paper, I suggest that this emphasis on fixed effects in the more general ANOVA literature has carried over into the econometrics literature on panel data analysis, despite the greater similarity of the problems there addressed to the animal breeding or genetic analyses described by Eisenhart (1947) and Fisher (1918a and 1918b) or to the analysis of astronomical observations described by Airy (1861), all of which require random-effects models.17

  • 18 See also Hemmerle and Hartley (1973).

32Of course, the work of statisticians on random-effects, mixed, and variance-components models did not cease despite the formidable computational problems encountered. The history of the development of the subject is well summarized by Searle (1992, Chapter 2) and Anderson (1978); Rao (1997) gives a more technical account emphasizing developments since 1950. Here are some highlights: Henderson (1953), who worked primarily on animal breeding, developed method-of-moments methods for handling random-effects and mixed models. Hartley and J. N. K. Rao (1967) dealt broadly with the then thorny question of maximum-likelihood methods for variance-components models. They give a completely general ML solution to the n independent random-effects case with unknown overall mean depending linearly on some observed conditioning variables and an overall disturbance independent of the random-effects (See also Harville, 1977, and the discussion between him and J. N. K. Rao which follows). Searle and Henderson (1979) carried out a detailed analysis of maximum-likelihood and related methods of estimating variance-components models. Searle, Casella, and McCullough (1992) contains nearly all of this material in revised form in one place or another in a long book. Nerlove (1971b) works out the appropriate transformation of the observed variables required for the diagonalization of the variance-covariance matrix of the 3-component random-effects model with unknown overall mean depending on strictly exogenous variables (but discussion of this and related matters more properly belongs in succeeding sections on development of panel data models in the econometrics literature). Back-to-back, in the same issue of Econometrica, Henderson (1971) complains that he solved the problem of finding the inverse of the variance-covariance matrix, which is essential to ML estimation much earlier in (Henderson, 1953). But this is not the same thing as finding the transformation which diagonalizes this matrix; a method for transforming the original observations in order to diagonalize the variance-covariance matrix, it seems to me, is essential for applying maximum-likelihood methods to panels with a large cross-section dimension. This is done for the 3-component model in Nerlove (1971b) and, much more generally, in Searle and Henderson (1979), whose results are also reported in Searle, Casella, and McCullough (1992, 144-146).18

33For more details on the statistical literature on random-effects and mixed models, see the survey by Khuri and Sahai (1985) and the bibliographies of Sahai (1979), containing 2000 citations, and Sahai Khuri and Kapadia (1985), containing an additional 700 citations.

34I turn now to developments in recognizing the importance of individual heterogeneity and path dependence in econometrics. Much of the discussion concerns the appropriateness of accounting for individual heterogeneity by fixed effects or by random effects. The central problem, however, which I now emphasize, is how to account for path dependence in the presence of individual heterogeneity, irrespective of the approach to the latter. The usual approach is to introduce dynamic elements in the statistical analysis, for example, a lagged value of the dependent variable or some other form of distributed lag.

6. Trygve Haavelmo (1911-1999), Jacob Marschak (1898-1977), Clifford Hildreth (1917-1995): The Disturbances in Econometric Relationships

35In his famous and influential monograph, The Probability Approach in Econometrics, Haavelmo (1944) laid the foundations for the formulation of stochastic econometric models and an approach which has dominated our discipline to this day. He wrote:

... we shall find that two individuals, or the same individual in two different time periods, may be confronted with exactly the same set of specified influencing factors [and, hence, they have the same y*, ...], and still the two individuals may have different quantities y, neither of which may be equal to y*. We may try to remove such discrepancies by introducing more “explaining” factors, x. But, usually, we shall soon exhaust the number of factors which could be considered as common to all individuals, and which, at the same time, were not merely of negligible influence upon y. The discrepancies y - y* for each individual may depend upon a great variety of factors, these factors may be different from one individual to another, and they may vary with time for each individual. (Haavelmo, 1944, 50)

And further that:

...the class of populations we are dealing with does not consist of an infinity of different individuals, it consists of an infinity of possible decisions which might be taken with respect to the value of y...we find justification for applying them [stochastic approximations] to economic phenomena also in the fact we usually deal only with—and are interested only in—total or average effects of many individual decisions, which are partly guided by common factors, partly by individual specific factors... (Haavelmo, 1944, 51 and 56)

36 Marschak (1950) and (1953) further amplified Haavelmo’s themes in his introduction to Cowles Commission Monographs 10 and 14, observing that:

The numerous causes that determine the error incurred...are not listed separately; instead their joint effect is represented by the probability distribution of the error, a random variable (1950,18) [, which] called `disturbance’ or `shock,’ and can be regarded as the joint effect of numerous separately insignificant variables that we are unable or unwilling to specify but presume to [be] independent of observable exogenous variables. (1953, 12)

  • 19 Marschak and Andrews (1944). For an extended discussion see Nerlove (1965).

37 While this approach, which goes back to Laplace, works well for macro econometric relationships and aggregate data, it will not serve us well for micro econometric data and relationships. Heckman in his Nobel Memorial Prize Lecture (2001, 257-258) states this well and succinctly, calling this the Cowles paradigm rather than the Laplacian one. In an earlier paper dealing with micro data on estimation of production functions, Marschak takes a different and more appropriate approach.19

38 In a remarkable, but unfortunately, until recently, virtually inaccessible, paper, Clifford Hildreth (1950), then at the Cowles Commission at the University of Chicago, set out a three-component model for the latent disturbances in a simultaneous-equations model and considered estimation when these components might be considered random effects or when two of them, period effects and individual effects, might be considered fixed effects and thus parameters to be estimated.20 The case Hildreth (1950) considers “ that in which the investigator has observations on each of a group of economic units for a number of time periods” (1950, 2). Using the old Cowles terminology but a somewhat altered notation, Hildreth’s model is as follows: Let yit be a vector of the current values of the ith economic unit in the tth time period and let zit be a vector of values of predetermined variables. μ is an overall constant. Then

(2) Image 100002010000010A0000001996921704.png.

39Hildreth then goes on to say that some of the variables, both endogenous and predetermined, may not vary across individuals at a point in time or may remain fixed for a given individual over time. He then says some of these may be unobserved or latent: “It may be believed that there are unobserved individual characteristics which cause individuals to act differently and which are persistent over time. There may be unobserved influences that affect individuals in pretty much the same way but change over time” (1950, 3). To take account of these, Hildreth then introduces vectors of constants interpreted as fixed variables, μi* and μ*t , and rewrites (2) as

  • 21 A referee notes that transforming (3) to its reduced form gives an error components, seemingly unre (...)

(3) Image 10000201000001450000001ADB07481E.png.21

40But he adds ―significantly―“I find it difficult to choose between the alternatives of allowing for these variations peculiar to individuals and variations peculiar to time through fixed parameters [as in (3)] or through random parameters” (1950, 3). In the latter case, Hildreth writes

(4) Image 100002010000014500000019453D028E.png,

where ωi, λt and εit are, respectively, “part of the disturbance in all equations relating to the i-th individual,“ ”part of the disturbance for each individual in the t-th time period” (1950, 4) and all the rest. Hildreth goes on to consider the special case of one equation in one endogenous variable in detail. He goes on to write down a likelihood function for the parameters in a single equation:

(5)Image 100002010000013B000000190AA2E19E.png,

assuming ωi, λt and εit to be jointly normal with constant variances and uncorrelated. But then he adds: “Maximum likelihood did not work at all well in this case. The estimation equations are difficult to derive and appear to be highly nonlinear in the unknown parameters.” (1950, 11) But derive them he does and thus his story ends, but not his influence. He has completely stated the case for individual heterogeneity in economic relationships among individuals and time periods. Not only that, but Hildreth also presents the case for using fixed effects for computational simplicity.

7. Irving Hoch (1926- ) and Yair Mundlak (1927- ): Fixed-Effects Reign Supreme

  • 22 For extended "variations on the theme" of Marschak and Andrews, see Nerlove (1965). The Marschak-An (...)

41In 1944, Marschak and Andrews (1944) published their famous article, “Random Simultaneous Equations and the Theory of Production.”22 The basic problem raised by Marschak and Andrews is that in a model which involves profit-maximizing or cost-minimizing firms, the choices of input levels and, in the case of profit maximization, output levels as well, are endogenous. Imperfections in profit maximization, differing knowledge of technology, and unobserved variations in other variables such as weather, fixed factors of production, and heterogeneity of inputs, all give rise to disturbances in the relation among inputs and outputs and factor and product prices. Even if the latter are exogenously determined, in a cross section of firms, the relation between output(s) and inputs is what is termed a confluent relationship, that is, the relationship reflects the unobserved differences among firms and variations in the prices they face.

  • 23 His dissertation title was "Estimation of Agricultural Resource Productivities Combining Time Serie (...)
  • 24 In his more detailed summary, Hoch (1962) cites both Hildreth (1950) as his source for the idea of (...)

42In the mid-1950's, Irving Hoch, then a graduate student at the University of Chicago, tackled this problem. His Ph. D. dissertation completed in 1957, written under the direction of a committee chaired by D. Gale Johnson, which also had earlier included Hildreth, dealt with the problem of how to combine cross section and time series data for a panel of firms in order to resolve or partially resolve the identification and estimation issues posed by Marschak and Andrews (1944). The bulk of his research was published in two papers in Econometrica (Hoch 1958, 1962), but a preliminary report of his inquiry was presented to the meeting of the Econometric Society in Montreal in September, 1954, and reported in Hoch (1955).23 In the research reported in 1954, Hoch used a panel of 63 Minnesota farms over the six-year period, 1946-1951, to estimate a Cobb-Douglas production function relating the dollar value of output to the value of inputs in four categories of inputs, all in logs; he introduced fixed effects linearly for both year and farm, in what was then, and now, called Analysis of Covariance in the general statistical literature, and as suggested in Hildreth (1950).24 Hoch (1962) interpreted his results in terms of left-out factors, particularly the firm effects in terms of differential managerial ability, or technical efficiency as he called it. Mundlak (1961) builds on this idea in his famous article, “Empirical Production Functions Free of Management Bias.”

43Suppose there is an unobserved factor of production called management. Mundlak (1961, 44) writes: “...we shall assume that whatever management is, it does not change considerably over time; and for short periods, say a few years, it can be assumed to remain constant.” Mundlak then asks what the usual OLS regression of the log of output on the logs of the input levels means, interprets the result in terms of the “intrafirm” function as distinguished from the “interfirm” function (Bronfenbrenner, 1944), and argues that along the lines suggested by Hoch (1955), a panel of firms for which the management factor can be assumed to be approximately fixed over time for each firm can be used to obtain unbiased estimates of the intrafirm production function. The technique he suggests is covariance analysis and he cites Scheffé (1959) as his statistical authority. Mundlak treats the firm fixed effects as proportional to a latent variable measuring “management” and is thus able to measure “management bias” by comparing the fixed firm-effect regression with the OLS regression without firm effects. He suggests introducing a year effect as well. He implements these suggestions in a panel study of 66 family farms in Israel for the period 1954-1958. Several more papers followed along similar lines (Mundlak, 1963; Mundlak and Kaddar, 1964; Mundlak and Hoch, 1965) and, in Mundlak (1978a, 1978b), he mounted a spirited defense of the fixed-effects approach.

44Mundlak (1978a) bases his defense of fixed-effects models on two counts: First, “Without loss of generality, it can be assumed from the outset that the effects are random and view the [fixed-effects] inference as a conditional inference, that is, conditional on the effects that are in the sample.” And, second, “The whole literature which has been based on the imaginary difference between the two based on an incorrect specification which ignores the correlation between the effects and the explanatory variables.” (1978a, 70)

  • 25 Consider a simple supply and demand model. Condition on price. Can one interpret the regression of (...)
  • 26 The absurdity of the contention that possible correlation between some of the observed explanatory (...)

45The issue of conditional versus unconditional inference is not as trivial as Mundlak appears to suggest; it lies at the heart of the debate between Fisher and his critics described by Aldrich (1999). Aldrich further points out that the question lies behind the simultaneous-equations debate stemming from Haavelmo (1944) and much discussion that preceded his work. One can condition on what one pleases, but then one cannot generally interpret the result as structural.25 This is the same point made by Eisenhart (1947). What of the correlation between the effects and the observed explanatory variables? As Mundlak and others have been at pains to point out, fixed effects are equivalent to considering only deviations from individual means and ignoring the cross sectional variation of the means themselves. Using this information is equivalent to attributing some of the explanatory effect of an observed variable to the relation between it and an effect. A good thing in my view, not a bad thing as Mundlak would have us believe. Another way of saying the same thing is to say that fixed-effect methods throw away important and useful information about the relation between the explanatory and the explained variables in a panel.26 This point is widely misunderstood among econometric practitioners and the reason cited by Mundlak is frequently used to justify the choice of a fixed-effects model, for computational simplicity, notwithstanding its inappropriateness for getting at structural relations in a nonexperimental context.

8. Pietro Balestra (1935-2005) and Marc Nerlove (1933- ): Random-Effects Seem More Natural

46It is often said that “Ignorance is bliss.” But one should also remember Santayana: ”Those who cannot remember the past are condemned to repeat it.“ Because I was a participant in the development of what used to be called Balestra-Nerlove models, I will tell the story of Balestra and Nerlove (1966) in some detail. It is less a story of bliss than of repetition of the past.

  • 27 W. Brian Arthur has written extensively on path dependence in the economy (Arthur, 1989, 1994). It (...)

47Balestra arrived in Stanford in 1959, having spent the previous two years at the University of Kansas acquiring a Master’s degree in Economics and teaching French and Italian to keep body and soul together. I arrived in 1960. We did not immediately meet. In 1963, Balestra completed a second Master’s degree (in Statistics) and went to work for the Stanford Research Institute in Menlo Park as an economist and econometrician. In the summer of 1963 he came in with a proposed thesis topic using the data on natural gas consumption by individual U. S. states over a period of years, which he had collected for an SRI project. His idea was to treat the demand for gas as a derived demand from the demand for residential space heating and so, in the manner of Fisher and Kaysen (1962), to include the stock of gas-using durable equipment in the equation to be estimated. Such a derived demand model is presented in our joint paper (Balestra and Nerlove, 1966, 585-589) and results in a dynamic demand model, which is of the familiar geometrically distributed lag form including a lagged value of gas consumption, when the unobserved stock is eliminated by substitution from the appropriate stock-flow relationship. The importance of this derivation lay in the particular interpretation it implied of the coefficient of lagged gas consumption in the relationship to be estimated: It could be interpreted in terms of the annual depreciation rate for the durable gas appliance used together with gas to yield space heat (that is, 1 minus that rate, in fact). Introduction of either the stock of gas-using equipment or the derived distributed lag introduces path dependence. One could argue that all meaningful economic relationships, at least at the individual level, are essentially dynamic, that is, exhibit path or state dependence in this sense.27 More below on the difficulty of distinguishing state dependence from individual heterogeneity.

  • 28 Throughout I use “bias” to refer to inconsistency of an estimate, with no possibility of confusion (...)

48The data collected by Balestra for 36 states over the period 1950-1962 comprised gas consumption normalized by degree days, average price per btu (British thermal unit)deflated by the Consumer Price Index, population, and per capita personal income in 1961 dollars. While these data do not refer to individual economic agents, the presumption is that they reflect it. Further, because of the presumption of structural change about the middle of the period, attention was restricted to the data for the 36 states over the 6-year period, 1957-1962. Within a couple of weeks, Balestra, always a fast worker, reappeared with a set of OLS regressions, involving various restrictions, using the pooled sample and also one regression including fixed effects for the individual states. The multiple regression correlations were all uniformly high (~0.99) but, in view of the interpretation afforded by the theoretical model of the coefficient of lagged gas consumption, the estimates of this parameter were bizarre: all greater than one implying negative depreciation of gas space-heating equipment for the regressions based on the pooled sample, and an implausibly low value (0.68) for the regression which included state-specific fixed effects, implying extremely rapid depreciation (32 percent per annum). We now understand why such results were obtained, but in 1963 about all we understood was the inadequacy (bias) of the regressions based on the pooled sample.28 Moreover, there is the more general problem of distinguishing path dependence from individual heterogeneity which is discussed below.

  • 29 In Balestra and Nerlove (1966), we cite Kuh (1959) as our authority. Kuh, in turn, cites nobody, bu (...)

49I recall rather pedantically explaining that the disturbances represented numerous, individually insignificant variables affecting the gas consumption in a particular state in a particular year some of which were peculiar to the individual state, i.e., individual state-specific, and didn’t change much or at all over time. I must have sounded much like Sir George Biddell Airy 102 years before explaining the day-specific effects in the measurements taken in the course of a single night of the angular diameter of Jupiter! Being quite unaware of Airy, Fisher, or Eisenhart, and even of Scheffé, we proceeded to formulate a simple two-component random-effects model, although we do mention the possibility of a separate year effect.29 We also had some doubts about the independence of state-specific time-invariant effects, one from another, because of the arbitrariness of the geographical boundaries with respect to the economic behavior analyzed. Our formulation then leads to the familiar block-diagonal residual variance-covariance matrix, dependent on two unknown parameters:

(6) Image 10000201000000DB00000019C1837592.png, where

(i) E(μi)= E(εit)= 0 , all i and t,

(ii) E(μiεjt)= 0 , all i, j and t,

(iii) E(μiμj)= σ2μ for i=j, = 0 otherwise,

(iv) E(εitεjs)= σ2ε for t=s, i=j, = 0 otherwise.

Both μi and εit are assumed to be uncorrelated with xit for all i and t. Stacking the u’s

(7) Image 10000201000000D700000054BF107027.png, where A is a TⅹT matrix

Image 10000201000000B3000000306587F4A3.png.

  • 30 After we had prepared a draft of a report on the investigation, William Madow of SRI called our att (...)

ρ was called the intraclass correlation by Fisher (1918a, 1925). In our case, T= 6 and N= 36, so the matrix Ω was 216ⅹ216. Inverting such a large matrix, even were ρ known, would have been a problem for us at that time. Wallace and Hussein (1969) and Henderson (1971) were yet to come. However, the matrix Ω has a rather simple structure despite its large size. Balestra was then, as he has ever been, a wiz with matrices; it took him about a week to find the characteristic roots of Ω and the orthogonal transformation which would reduce Ω to diagonal form.30 The roots of Ω are

Image 10000201000001F8000000168E28160D.png

50The orthogonalizing transformation effectively weights the deviations of each observation from its individual specific mean and the individual specific means themselves by the reciprocals of the square roots of the characteristic roots. Thus, if zit is a typical observation (either on the dependent variable or on one of the dependent variables), the transformed observations are

Image 10000201000001770000001A5008E110.png.

51If T is very large, the importance of the individual-specific means in the transformed observations becomes negligible and we are left with transformed observations which are simply the deviations from these means. The analysis is then equivalent to one which includes individual-specific fixed effects. But if T is small, cross section variation plays a greater role.

  • 31 I was then, as now, a great fan of likelihood methods. Although I had acquired and almost certainly (...)
  • 32 See also Anderson and Hsiao (1981, 1982), Blundell and Smith (1991), below for more details, and th (...)
  • 33 Hildreth (1949) would have told us this, but we were unaware of this paper at the time. Consider

52Of course, ρ is unknown. Although a great deal of Balestra’s 1965 dissertation (published as Balestra, 1967) and of the paper we ultimately published together (Balestra and Nerlove, 1966) deals with various alternative methods of estimation, I recall that in late 1963 we headed straight for maximum likelihood as the preferred method for estimating ρ simultaneously with the other parameters. It was only because this method seemed to fail that we turned to other alternatives later.31 At the time, however, we didn’t realize, as Bhargava and Sargan (1983) were to show us twenty years later, that the presence of a lagged value of the dependent variable as one of the explanatory variables, i.e., the autoregressive nature of the relationship to be estimated from the panel data, makes all the difference in the formulation of the likelihood function.32 Again there is the problem of separating path dependence from individual heterogeneity. Reasoning from the non-dynamic case, we blundered taking the lagged dependent variable as predetermined, that is, as fixed just like one of the x’s.33 In terms of ξ and η, the characteristic roots of Ω, this leads to the likelihood function:

(8) Image 10000201000001500000001BB9130985.png

Image 10000201000001600000002B9451F086.png

Image 10000201000000CD00000033B468D0FC.png,

where y*, x* and y*-1 are the transformed variables and the overall constant has been eliminated by expressing all observations as deviations from their overall means.

53Fresh from the study of Koopmans and Hood (1953, esp. 156-158), I suggested concentrating log L in (8) with respect to σ2, β and γ, resulting in

(9) Image 10000201000001A3000000459A7F8E25.png,


Image 100002010000011B0000003092E09DEA.png.

  • 34 I was later able to reproduce such boundary solutions in the Monte Carlo results reported in Nerlov (...)

54It was, we thought, a simple enough matter to find the maximum of log L* with respect to ρ. Within a week, Balestra was back with the disconcerting news that log L* didn’t have a maximum within the open interval (0, 1); indeed the maximum of the concentrated log likelihood function occurred for ρ= 0, which implied that the OLS estimates of β and γ from the pooled sample were best, but this we knew led to estimates of depreciation which were negative and therefore unacceptable.34

  • 35 The formula was later adapted in Nerlove (1967) to calculate the first-stage estimate of ρ from the (...)

55What to do? Being youthful, and therefore somewhat unwise from my present perspective, we rejected the likelihood approach. Instead, we turned to a two-stage procedure. Balestra suggested using instrumental variables in the manner of 2SLS (two-stage least-squares)to obtain consistent estimates of γ and then a method of calculating an estimate of ρ from the residuals from the first-stage instrumental variables regression in such a way that the estimate would have to lie in the interval (0, 1).35 The second stage then consisted of GLS applied to the original data (equivalent to OLS applied to the data transformed using the first-stage estimate of ρ). The instruments chosen were simply the lagged values of the exogenous explanatory variables (the x’s). This resulted in a positive annual depreciation rate of about 4.5%, which we both regarded as plausible. A Technical Report of the Stanford Institute for Mathematical Studies in the Social Sciences appeared in late 1964, Balestra’s dissertation was completed in 1965, a joint paper was published in 1966, and Balestra’s dissertation was published as a book in 1967. I think the important thing about this investigation was the way in which our economic theory of the determination of the demand for natural gas informed and guided the estimation of the relationship. Many investigators would have stopped with an R2 of 0.99, but it was the odd result of negative depreciation which prevented us from doing so. Nonetheless, there were lots of loose ends that remained to be tied up.

56Tying up loose ends was one of my own principal preoccupations in the decade following, for example Nerlove (1967, 1971a, 1971b). Notable contributions were also made by G. S. Maddala (1971a, 1971b, 1975 first published in 1994) and Maddala and Mount (1973). Methodological research by Mundlak (1978a), Amemiya (1971), C. R. Rao (1972), and Fuller and Battese (1974) emphasized alternatives to maximum likelihood. A number of important empirical applications also appeared in the decade: Beginning with Griliches and Mason (1972), Griliches (1974, 1977) and Chamberlain and Griliches (1975) undertook a series of investigations on the relation among income, ability, and schooling. Mention should also be made of two Harvard Ph. D. dissertations done under Zvi Griliches’s supervision: Mazodier (1971, partially published as Mazodier, 1972), and Chamberlain (1975). Due in part to the availability of panel data from the University of Michigan Panel Study of Income Dynamics, several papers appeared which used panel data to study the determination of earnings profiles: Hause (1977), Lillard and Willis (1978), and Lillard and Weiss (1979). These are basically dynamic panel models and the problems encountered underscore the difficulties apparent in retrospect in the Balestra-Nerlove (1966) investigation. But at that time these difficulties were still poorly understood. The stimulus to methodological development in this period afforded by interesting and readily available data sets cannot be overemphasized.

57In August 1977, Pascal Mazodier organized the first conference on panel data econometrics (Colloque International du CNRS, 1978). He had written his thesis (1971) on panel data econometrics at Harvard under Griliches’s supervision and was at that time Director of the Unité de Recherche of the Institut National de la Statistique et des Études Économiques (INSEE), of which Edmond Malinvaud was then Director General. The influence of this conference on the subsequent course of research was enormous. One of the most significant papers presented at the conference, as I said in my introduction to the Conference Proceedings, was the paper presented by James Heckman (1978), “Simple Statistical Models for Discrete Panel Data Developed and Applied to Test the Hypotheses of True State Dependence Against the Hypotheses of Spurious State Dependence.”

9. James Heckman (1944- ) and the Hidden Hand of the Past: Attempts to Separate State Dependence from Individual Heterogeneity

58Everyone has a past, and differences among individuals are not merely the result of their genetic endowments, but of their past experiences. In one of the great books of the 20th century, Risk Uncertainty and Profit, Frank H. Knight (1921), observed:

The fundamental fact about society as a going concern is that it is made up of individuals who are born and die and give place to others; and the fundamental fact about modern civilization is that it is dependent upon the utilization of three great accumulating funds of inheritance from the past, material goods and appliances, knowledge and skill, and morale. Besides the torch of life itself, the material wealth of the world, a technological system of vast and increasing intricacy and the habituations which fit men for social life must in some manner be carried forward to new individuals born devoid of all these things as older individuals pass out. (1921, 375)

59While, in some respects, this is the old nature versus nurture issue, in the present context it raises the question of path dependence versus idiosyncratic shocks in economic relationships. The invention of distributed lags by Irving Fisher was designed to represent path dependence and has given rise to a large literature. In the case of the Balestra-Nerlove model, a lagged value of the dependent variable or geometric distributed lag is used to model the dynamics of demand for natural gas. However, as noted above, Balestra and Nerlove had considerable trouble estimating the lag coefficient either by maximum likelihood using a random-effects model to model individual heterogeneity or by using fixed effects. In Monte Carlo experiments, Nerlove (1971a) showed that bias in lag coefficient using fixed-effects and, as evidenced by the presence of boundary solutions in the maximization of likelihood, in the case of a random-effects model. Trognon (1978) showed he could reproduce the Balestra-Nerlove results for boundary solutions analytically, and Maddala (1971a, 1971b) demonstrated analytically the existence of multiple maxima of the likelihood function, including boundary solutions. S. Nickell (1981) showed that estimates of the lag coefficient were biased downwards in the case of fixed-effects models. All of these things represent essentially the lack of identifiability of the hidden hand and individual idiosyncratic effects in micro-econometric data.

60In his paper for the 1977 Paris conference on Panel Data Econometrics, Heckman (1978, 228) wrote:

This paper considers simple answers to the following question. From an observed series of discrete events from individual histories, such as spells of unemployment experienced by workers or the labor force participation histories of married women, is it possible to explain the frequently noted empirical regularity that individuals who experience an event in the past are more likely to experience the event in the future? There are two distinct explanations for this regularity. One explanation is that individuals who experience this event are altered by their experience. A second explanation is that individuals differ in their propensity to experience the event. If individual differences are stable over time, individuals who experience an event in the past are more likely to experience the event in the future, even though the actual experience of the event has not modified this behavior.

  • 36 Blundell (2001) has a good discussion of much of this work up to 2000.

61In his answer Heckman uses two or more periods of panel data to see whether, when exogenous variables are constant, experience in previous periods actually does affect the likelihood of the event in the current period. He makes the assumption that events prior to the sample period have no effect. Of course it may take more than one experience in the past to affect current behavior discernibly. Heckman continued to work on the problem of identifying state or path dependence and individual heterogeneity in the years following the Paris Conference and continues to this day, for example, in papers on duration models: Heckman and Borjas (1980). Heckman (1981a, 1981b, 1982, 1984, 1991), Heckman and Singer (1984, 1985), Flinn and Heckman (1983).36 The simple idea of introducing a lagged dependent variable in regression with unobserved heterogeneity will not resolve the identification problem. Much more explicit modeling of path dependence and choice is required as Heckman and co-authors show.

62In his 1991 paper in the American Economic Review, Heckman concludes (1991, 79):

The ability to distinguish between heterogeneity and duration dependence in single spell duration models rests critically on maintaining explicit assumptions about the way unobservables and observables interact. A general nonseparable model is nonparametrically underidentified. … Economically extraneous statistical assumptions drive the answer…Viewed as the prototype for identification in general nonergodic models, these results are not encouraging.

63After 1991, Heckman increasingly turned to evaluation of economic programs and policies in which the question of path dependence versus individual heterogeneity plays a crucial role. In recent years, Heckman has focused on the related problem of inferring causal relationships from micro economic data. These issues have continued to excite considerable interest among those working in applied and methodological issues in panel data econometrics and with other forms of micro economic data.

Concluding Remarks

64This paper has traced the history of how the disturbances in econometric relationships are viewed by econometricians. I begin with the problem of the combination of multiple astronomical observations to yield more accurate estimates of a single magnitude. It was this problem the solution of which yielded both the methods of maximum likelihood and ordinary least squares. In the middle of the nineteenth century the astronomer George Biddell Airy broke new ground by distinguishing between different sources of error; he developed the first variance-components, or random-effects, model of individual heterogeneity. Towards the end of the nineteenth century and the first quarter of the twentieth century, Galton and R. A. Fisher used such models of heterogeneity in the analysis of anthropomorphic and biometric data. Fisher also used regression methods and invented the ANOVA table in his analysis of heterogeneous agronomic data. These methods later came to dominate the analysis of economic and social data with models with fixed effects or, in econometrics, “dummy variables,” or “differences-in-differences.” However, the residuals or disturbances in econometric relationships were viewed in terms of the Laplacian paradigm of the superposition of large numbers of small independent shocks as normally, but perhaps not independently, distributed. The importance of individual heterogeneity in the analysis of micro social and economic data was, however, increasingly recognized. Such heterogeneity was taken into account with fixed effects by Hoch and Mundlak and with random effects by Wallace and Hussein and by Nerlove and Balestra. The debate as to which of the two is more appropriate is explored in this paper.

65Because past histories of individuals or aggregates of individuals are important in shaping their behavior, the effects of policies depend on those past histories as well as other individual differences. Whether we model these differences with fixed or random effects is not the issue; the problem is to separate the effects of past histories which we try to model using panel or longitudinal data to estimate dynamic models from other individual heterogeneity.

  • 37 I, myself, have worked on the closely related problem of the initial conditions in dynamic models t (...)

66I am ending my story with Heckman’s 1991 AER paper, although the area of panel data econometrics has been one of active research on the estimation of dynamic models from panel data and avoiding the potential biases in both fixed- and random-error components.37 Much of this research is well summarized in Hsiao (2003, Chapter 4) and Baltagi (2013, Chapter 8). Unfortunately, in my view, in the more than 35 years since the Paris Conference in 1977 no solution has been found to the general problem of distinguishing between “the hidden hand of the past” and individual heterogeneity. As Heckman says, “Economically extraneous statistical assumptions drive the answer.” (1991, 79) The conclusion of this paper is that there is no economically meaningful way to separate the effects of individual heterogeneity from what has happened to the individual in the past.

This article is partially adapted from my essay “The History of Panel Data Econometrics, 1861-1997,” of Marc Nerlove (2002a, 4-81), Essays in Panel Data Econometrics, Cambridge University Press, and revised with additional material. I dedicate this paper to my co-author for several earlier papers on panel data econometrics and former student the late Pietro Balestra. I am indebted to the editor, two referees, and to Olav Bjerkholt and Anke S. Meyer for helpful comments on earlier drafts.

Haut de page


Airy, George Biddell. 1861. On the Algebraical and Numerical Theory of Errors of Observations and the Combination of Observations. Cambridge and London : Macmillan and Co.

Aldrich, John C. 1999. Fisher and Fixed X Regression. Unpublished discussion paper, University of Southampton, November 24.

Amemiya, Takeshi. 1971. The Estimation of the Variances in a Variance-Components Model. International Economic Review, 12(1) : 1-13.

Anderson, R. E. 1978. Studies on the Estimation of Variance Components. Ph. D. dissertation, Cornell University, Ithaca. Summarized in On the History of Variance Component Estimation,“ in L. Dale Van Vleck and Shayle R. Searle, (eds.), 1979, Variance Components and Animal Breeding. Ithaca, NY : Cornell University, 19-57.

Anderson, Richard L. and T. A. Bancroft. 1952. Statistical Theory in Research. New York, NY : McGraw-Hill.

Anderson, Theodore W. and Cheng Hsiao. 1981. Estimation of Dynamic Models with Error Components. Journal of the American Statistical Association, 76(375) : 598-606.

Anderson, Theodore W. and Cheng Hsiao. 1982. Formulation and Estimation of Dynamic Models Using Panel Data. Journal of Econometrics, 18(1) : 47-82.

Arthur, W. Brian. 1989. Competing Technologies, Increasing Returns, and Lock-in by Historical Events. Economic Journal, 99(394) : 116‑131.

Arthur, W. Brian. 1994. Increasing Returns and Path Dependence in the Economy. Ann Arbor, MI : University of Michigan Press.

Avery, Robert B. 1977. Error Components and Seemingly Unrelated Regressions. Econometrica, 45(1) : 199-208.

Balestra, Pietro and Marc Nerlove. 1966. Pooling Cross Section and Time Series Data in the Estimation of a Dynamic Model : The Demand for Natural Gas. Econometrica, 34(3) : 585-612.

Balestra, Pietro. 1967. The Demand for Natural Gas in the United States : A Dynamic Approach for the Residential and Commercial Market. Amsterdam : North-Holland Publishing.

Baltagi, Badi H. 1980. On Seemingly Unrelated Regressions with Error Components. Econometrica, 48(6) : 1547-1551.

Baltagi, Badi H. 2013. Econometric Analysis of Panel Data, 5th edition. New York, NY : Wiley.

Bayes, Thomas. [1764] 1970. An Essay towards Solving a Problem in the Doctrine of Chances. Philosophical Transactions of the Royal society of London for 1763, 53 : 370-414. Reprinted with an introduction by G. A. Barnard, in E. S. Pearson and M. G. Kendall (eds.), Studies in the History of Statistics and Probability, London : Chas. Griffin, 131-153.

Bernoulli, Daniel. [1777] 1970. The Most Probable Choice between Several Discrepant Observations and the Formation there from of the Most Likely Induction. In Latin, in Acta Academiae Scientiarum Imperialis Petropolitanae. Reprinted as “Observations on the foregoing Dissertation of Bernoulli,” with an introduction by M. G. Kendall and an extended commentary by Leonard Euler, in Egon S. Pearson and M. G. Kendall (eds.), Studies in the History of Statistics and Probability, London : Chas. Griffin, 155-172.

Bernoulli, Jacob. 1713. Ars Conjectandi. Basel : Thurnisiorum.

Bhargava, Alok and J. D. Sargan. 1983. Estimating Dynamic Random Effects Models from Panel Data Covering Short Time Periods. Econometrica, 51(6) : 1635-1659.

Biørn, Erik and Jayalakshmi Krishnakumar. 2008. Measurement Errors and Simultaneity. In Lászlo Mátyás and Patrick Sevestre (eds.), op. cit., 323-367.

Blundell, Richard and Richard J. Smith. 1991. Conditions initiales et estimation efficace dans les modèles dynamiques sur données de panel. Annales d’Économie et de Statistique, 20-21 : 109-124.

Blundell, Richard. 2001. James Heckman’s Contributions to Economics. Scandinavian Journal of Economics, 103(2) : 191-203.

Boumahdi, Rachid and Alban Thomas. 2008. Endogenous Regressors and Correlated Effects. In Lászlo Mátyás and Patrick Sevestre (eds.), op. cit., 89-112.

Box, Joan Fisher. 1978. R. A. Fisher : The Life of a Scientist. New York, NY : Wiley.

Bronfenbrenner, Martin. 1944. Production Functions : Cobb-Douglas, Interfirm, Intrafirm. Econometrica, 12(1) : 35-44.

Chamberlain, Gary and Zvi Griliches. 1975. Unobservables with a Variance-Components Structure : Ability, Schooling, and the Economic Success of Brothers. International Economic Review, 16(2) : 422-449.

Chamberlain, Gary. 1975. Unobservables in Econometric Models. Unpublished Ph. D. dissertation, Department of Economics, Harvard University.

Chamberlain, Gary. 1980. Analysis of Covariance with Qualitative Data. Review of Economic Studies, 47(1) : 225-238.

Chamberlain, Gary. 1984. Panel Data. In Zvi Griliches and Michael Intriligator (eds.), Handbook of Econometrics, vol. 2. Amsterdam : North-Holland, 1247-1318.

Chauvenet, William. [1863] 1960. A Manual of Theoretical and Practical Astronomy : Embracing the General Problems of Spherical Astronomy, the Special Applications to Nautical Astronomy, and the Theory and Use of Fixed and Portable Astronomical Instruments, with an Appendix on the Method of Least Squares, 1st edition, Philadelphia, PA : J. B. Lippincott and Co., 1863. The fifth edition, 1889, reprinted by Dover Publications, New York, NY.

Colloque International du CNRS. 1978. L’économétrie des données individuelles temporelles. Annales de l’INSEE, 30-31.

Crépon, Bruno and Jacques Mairesse. 1996. The Chamberlain Approach. In Lászlo Mátyás and Patrick Sevestre (eds.), op. cit., 323-391.

Daniels, Henry E. 1939. The Estimation of Components of Variance. Supplement to the Journal of the Royal Statistical Society, 6(2) : 186-197.

David, Paul A. 2001. Path Dependence, its Critics and the Quest for “Historical Economics”. In Pierre Garrouste and Stavros Ioannides (eds.), Evolution and Path Dependence in Economic Ideas : Past and Present. Cheltenham : Edward Elgar, 15-40.

de Finetti, Bruno. 1930. Problemi Determinati e Indeterminati nel Calculo delle Probabilità. Rendiconti Della R. Accademia Nazioinale dei Lincei, vol. 12, Serie 6, Fasc. 9.

de Finetti, Bruno. [1970]1990. Teoria delle Probabilità. Torino : Giulio Einaudi, 1970. Translated as Theory of Probability, New York, NY : Wiley.

de Moivre, Abraham. 1738. A Doctrine of Chances. London : A. Miller.

de Morgan, Augustus. 1838. An Essay on Probabilities. London : Longman.

Edgeworth, Francis Y. 1885. On Methods of Ascertaining Variations in the Rate of Births, Deaths and Marriages. Journal of the Royal Statistical Society, 48 : 628-649.

Edwards, Anthony W. F. 1997. What Did Fisher Mean by “Inverse Probability” in 1912-1922 ? Statistical Science, 12(3) : 177-184.

Eisenhart, Churchill. 1947. The Assumptions Underlying the Analysis of Variance. Biometrics, 3(1) : 1-21.

Fisher, Franklin M. and Carl Kaysen. 1962. The Demand for Electricity in the United States. Amsterdam : North-Holland Publishing.

Fisher, Ronald A. 1918a. The Correlation between Relatives on the Supposition of Mendelian Inheritance. Transactions of the Royal Society of Edinburgh, 52 : 399-433.

Fisher, Ronald A. 1918b. The Causes of Human Variability. Eugenics Review, 10(4) : 213-220.

Fisher, Ronald A. [1922] 1992. On the Mathematical Foundations of Theoretical Statistics. Philosophical Transactions of the Royal Society of London, Series A, 222 : 309-368. Reprinted with an introduction by Seymour Geisser in Samuel Kotz and Norman L. Johnson, Breakthroughs in Statistics, vol. 1, New York, NY : Springer-Verlag, 1-44.

Fisher, Ronald A. 1925. Theory of Statistical Estimation. Proceedings of the Cambridge Philosophical Society, 22 : 700-725.

Fisher, Ronald A. 1932. Inverse Probability and the Use of Likelihood. Proceedings of the Cambridge Philosophical Society, 28 : 257-261.

Flinn, Chris and James Heckman. 1983. The Likelihood Function for the Multistate-Multiepisode Model. In R. Bassmann and G. Rhodes (eds.), Models for the Analysis of Labor Force Dynamics, Advances in Econometrics, vol. 2. Greenwich, CT : JAI Press, 225-231.

Fuller, Wayne A. and George E. Battese. 1974. Estimation of Linear Models with Crossed-Error Structure. Journal of Econometrics, 2(1) : 67-78.

Galton, Francis. 1889. Natural Inheritance. London : Macmillan.

Gauss, Carl Friedrich. [1809] 1963. Theoria motus corporum celestium. Hamburg : Perthes und Besser, 1809. Translation by Charles H. Davis in Theory of Motion of Heavenly Bodies, New York, NY : Dover.

Gauss, Carl Friedrich. 1823. Theoria Combinationis Observationum Erroribus Minimis Obnoxiae, Pars Prior. Commentationes Societatis Regiae Scientiarum Gottingensis Recentiores 5.

Griliches, Zvi and William M. Mason. 1972. Education, Income, and Ability. Journal of Political Economy, 80(3), Part 2 : S74-Sl03.

Griliches, Zvi. 1974. Errors in Variables and Other Unobservables. Econometrica, 42(6) : 971-998.

Griliches, Zvi. 1977. Estimating the Returns to Schooling : Some Econometric Problems. Econometrica, 45(1) : l-22.

Haavelmo, Trygve. 1943. The Statistical Implications of a System of Simultaneous Equations. Econometrica, 11(1) : 1-12.

Haavelmo, Trygve. 1944. The Probability Approach in Econometrics. Econometrica, 12(Supplement) : iii-vi+1-115.

Hald, Anders. 1998. A History of Mathematical Statistics from 1750 to 1930. New York, NY : Wiley.

Halperin, Max. 1951. Normal Regression Theory in the Presence of Intra-Class Correlation. Annals of Mathematical Statistics, 22(4) : 573-580.

Hartley, Hermann Otto and Jon N. K. Rao. 1967. Maximum Likelihood Estimation for the Mixed Analysis of Variance Model. Biometrika, 54(1/2) : 93-108.

Harville, David A. 1977. Maximum Likelihood Approaches to Variance Component Estimation and to Related Problems. With a comment by Jon N. K. Rao and a rejoinder by Harville. Journal of the American Statistical Association, 72(358) : 320-340.

Hause, John C. 1977. The Covariance Structure of Earnings and the On-the-Job Training Hypothesis. Annals of Economic and Social Measurement, 6(4) : 335-365.

Heckman, James J. and George J. Borjas. 1980. Does Unemployment Cause Future Unemployment ? Definitions, Questions and Answers from a Continuous Time Model of Heterogeneity and State Dependence. Economica, 47(187) : 247-283.

Heckman, James J. and Burton Singer. 1984. Econometric Duration Analysis. Journal of Econometrics, 24(1-2) : 63-132.

Heckman, James J. and Burton Singer (eds.). 1985. Longitudinal Analysis of Labor Market Data. Cambridge : Cambridge University Press.

Heckman, James J. 1978. Simple Statistical Models for Discrete Panel Data Developed and Applied to Test the Hypothesis of True State Dependence against the Hypothesis of Spurious State Dependence. In Colloque International du CNRS, op. cit., 227-269.

Heckman, James J. 1981a. Statistical Models for Discrete Panel Data Heckman. In Charles F. Manski and Daniel L. McFadden (eds.), Structural Analysis of Discrete Data with Econometric Applications. Cambridge, MA : MIT Press, 114-178.

Heckman, James J. 1981b. The Incidental Parameters Problem and the Problem of Initial Conditions in Estimating a Discrete Time - Discrete Data Stochastic Process. In Charles F. Manski and Daniel L. McFadden (eds.), Structural Analysis of Discrete Data with Econometric Applications. Cambridge, MA : MIT Press, 179-195.

Heckman, James J. 1982. The Identification Problem in Econometric Models for Duration Data. In Werner Hildenbrand (ed.), Advances in Econometrics, Proceedings of the Fourth World Congress of the Econometric Society. Cambridge : Cambridge University Press.

Heckman, James J. 1984. The Chi-square Goodness of Fit Statistic for Models with Parameters Estimated from Microdata. Econometrica, 52(November) : 1543-1547.

Heckman, James J. 1991. Identifying the Hand of the Past : Distinguishing State Dependence from Heterogeneity. American Economic Review, 81(2) : 75-79.

Heckman, James J. 2001. Micro Data, Heterogeneity, and the Evaluation of Public Policy : Nobel Lecture. Journal of Political Economy, 109(4) : 673-748.

Hemmerle, W. J. and Hermann Otto Hartley. 1973. Computing Maximum Likelihood Estimates for the Mixed AOV Model using the W-Transformation. Technometrics, 15(4) : 819-831.

Henderson, Charles R. 1953. Estimation of Variance and Covariance Components. Biometrics, 9(2) : 226-252.

Henderson, Charles R. 1971. Comment on “The Use of Error Components Models in Combining Cross Section with Time Series Data”. Econometrica, 39(2) : 397-401.

Hildreth, Clifford. 1949. Preliminary Considerations Regarding Time Series and/or Cross Section Studies. Cowles Commission Discussion Paper, No. 333, July 18.

Hildreth, Clifford. 1950. Combining Cross Section Data and Time Series. Cowles Commission Discussion Paper, No 347, May 15.

Hoch, Irving. [1954] 1955. Estimation of Production Function Parameters and Testing for Efficiency. Paper presented to the Econometric Society, Montreal Meeting, September 10, 1954, report. Econometrica, 23 : 325-326.

Hoch, Irving. 1957. Estimation of Agricultural Resource Productivities Combining Time Series and Cross Section Data. Unpublished Ph.D. dissertation, University of Chicago, March.

Hoch, Irving. 1958. Simultaneous Equations Bias in the Context of the Cobb-Douglas production Function. Econometrica, 26(4) : 566-578.

Hoch, Irving. 1962. Estimation of Production Function Parameters Combining Time-Series and Cross-Section Data. Econometrica, 30(1) : 34-53.

Hsiao, Cheng. 2003. Analysis of Panel Data, 2d ed. Cambridge : Cambridge University Press.

Khuri, Andre I. and Hardeo Sahai. 1985. Variance Components Analysis : A Selective Literature Survey. International Statistical Review, 53(3) : 279-300.

Knight, Frank H. 1921. Risk, Uncertainty and Profit. Boston, MA : Houghton Mifflin Co.

Koopmans, Tjalling C. and William C. Hood. 1953. The Estimation of Simultaneous Linear Economic Relationships. In William C. Hood and Tjalling C. Koopmans (eds.), Studies in Econometric Method, Chapter 6. New York, NY : Wiley, 112-199.

Kuh, Edwin. 1959. The Validity of Cross-Sectionally Estimated Behavior Equations in Time Series Applications. Econometrica, 27(2) : 197-214.

Laplace, Pierre Simon. [1774] 1986. Mémoire sur la probabilité des causes par les évènements. Mémoires de l’Académie Royale des Sciences Presentés par Divers Savants, 6: 621-656. Translated in Stephen M. Stigler. 1986. Laplace’s 1774 Memoir on Inverse Probability, Statistical Science, 1(3): 359-378.

Laplace, Pierre Simon. 1812. Théorie analytique des probabilités. Paris: Courcier.

Legendre, Adrien Marie. 1805. Nouvelles méthodes pour la détermination des orbites des comètes. Paris : Courcier.

Lehmann, Erich L. 1993. The Fisher, Neyman-Pearson Theories of Testing Hypotheses : One Theory or Two ? Journal of the American Statistical Association, 88(424) : 1242-1249.

Lillard, Lee A. and Robert J. Willis. 1978. Dynamic Aspects of Earnings Mobility. Econometrica, 46(5) : 985-1012.

Lillard, Lee. A. and Yoram Weiss. 1979. Components of Variance in Panel Data Earnings Data : American Scientists, 1960-1970. Econometrica, 47(2) : 437-454.

Lindley, Dennis V. and Melvin R. Novick. 1981. The Role of Exchangeability in Inference. Annals of Statistics, 9(1) : 45-58.

Maddala, Gangadharrao Soundalyarao. [1975] 1994. Some Problems Arising in Pooling Cross-Section and Time Series Data. Discussion paper, University of Rochester, 1975. First published in G. S. Maddala, Econometric Methods and Applications, Volume 1. Aldershot : Edward Elgar, 223-245.

Maddala, Gangadharrao Soundalyarao. 1971a. The Use of Variance Components Models in Pooling Cross Section and Time Series Data. Econometrica, 39(2) : 341-358.

Maddala, Gangadharrao Soundalyarao. 1971b. The Likelihood Approach to Pooling Cross Section and Time Series Data. Econometrica, 39(6) : 939-953.

Maddala, Gangadharrao Soundalyarao. 1987. Recent Developments in the Econometrics of Panel Data Analysis. Transportation Research, Part A : General, 21(4–5) : 303-326.

Maddala, Gangadharrao Soundalyarao and Timothy D. Mount. 1973. A Comparative Study of Alternative Estimators for the Variance Components Model Uses in Econometric Applications. Journal of the American Statistical Association, 68(342) : 324-328.

Marschak, Jacob and William H. Andrews. 1944. Random Simultaneous Equations and the Theory of Production. Econometrica, 12(3/4) : 143-205.

Marschak, Jacob. 1950. Structural Inference in Economics : An Introduction. In Tjalling C. Koopmans (ed.), Statistical Inference in Dynamic Economic Models. New York, NY : Wiley, 1-50.

Marschak, Jacob. 1953. Economic Measurements for Policy and Prediction. In William C. Hood and Tjalling C. Koopmans (eds.), Studies in Econometric Method. New Haven, CT : Yale University Press, 1-26.

Mátyás, Lászlo and Patrick Sevestre (eds). 2008. Econometrics of Panel Data : A Handbook of the Theory with Applications. Third Revised Edition. Berlin and Heidelberg : Springer.

Mazodier, Pascal A. 1971. The Econometrics of Error Components Models. Unpublished Ph. D. dissertation, Department of Economics, Harvard University.

Mazodier, Pascal A. 1972. L’Estimation des modèles à erreurs composées. L’Annales de l’INSEE, 7 : 43-72.

Moran, Patrick A. P. and Cedric A. B. Smith. 1966. Commentary on R. A. Fisher’s Paper on The Correlation between Relatives on the Supposition of Mendelian Inheritance. London : Published for the Galton Laboratory, University College London, by the Cambridge University Press.

Mundlak, Yair and Gershon Kaddar. 1964. An Economic Analysis of Established Family Farms in Israel, 1953-1958. The Falk Project for Economic Research in Israel, Jerusalem.

Mundlak, Yair and Irving Hoch. 1965. Consequences of Alternative Specifications in Estimation of Cobb-Douglas Production Functions. Econometrica, 33(4) : 824-828.

Mundlak, Yair. 1961. Empirical Production Functions Free of Management Bias. Journal of Farm Economics, 43(1) : 44-56.

Mundlak, Yair. 1963. Estimation of Production and Behavioral Functions from a Combination of Cross-Section and Time-Series Data. In Carl F. Christ et al., Measurement in Economics : Studies in Mathematical Economics and Econometrics in Memory of Yehuda Grunfeld, 138-166. Stanford, CA : Stanford University Press.

Mundlak, Yair. 1978a. On the Pooling of Time Series and Cross Section Data. Econometrica, 46(1) : 69-85.

Mundlak, Yair. 1978b. Models with Variable Coefficients : Integration and Extension. In Colloque International du CNRS, op. cit., 483-509.

Nerlove, Marc and S. James Press. 1978. Review of Discrete Multivariate Analysis : Theory and Practice, by Yvonne M. M. Bishop, Stephen E. Fienberg, and Paul W. Holland, Cambridge, MA : MIT Press, 1976. The Bulletin of the American Mathematical Society, 84(3) : 470-480.

Nerlove, Marc and S. James Press. 1986. Multivariate Log-Linear Probability Models in Econometrics. In Roberto S. Mariano (ed.), Advances in Statistical Analysis and Statistical Computing : Theory and Applications, 1. Greenwich, CT : JAI Press, 117-171.

Nerlove, Marc. 1965. Estimation and Identification of Cobb-Douglas Production Functions. Chicago, IL : Rand McNally.

Nerlove, Marc. 1967. Experimental Evidence on the Estimation of Dynamic Economic Relations from a Time-Series of Cross Sections. Economic Studies Quarterly, 18(3) : 42-74.

Nerlove, Marc. 1968. Distributed Lags. International Encyclopedia of the Social Sciences, II. New York, NY : The Macmillan Co, 214-217.

Nerlove, Marc. 1971a. Further Evidence on the Estimation of Dynamic Economic Relations from a Time Series of Cross-Sections. Econometrica, 39(2) : 359-382.

Nerlove, Marc. 1971b. A Note on Error Components Models. Econometrica, 39(2) : 383-396.

Nerlove, Marc. 1999a. Properties of Alternative Estimators of Dynamic Panel Models : An Empirical Analysis of Cross-Country Data for the Study of Economic Growth. In Cheng Hsiao, Kajal Lahiri, Lung-Fei Lee, and M. Hashem Pesaran (eds.), Analysis of Panels and Limited Dependent Variable Models. Cambridge : Cambridge University Press, 136-170.

Nerlove, Marc. 1999b. Likelihood Inference for Dynamic Panel Models. Annales d’économie et de statistique, 55/56 : 369-410.

Nerlove, Marc. 2002a. Essays in Panel Data Econometrics. Cambridge : Cambridge University Press.

Nerlove, Marc. 2002b. The History of Panel Data Econometrics, 1861-1997. In Marc Nerlove, Essays in Panel Data Econometrics. Cambridge : Cambridge University Press, 4-81.

Nickell, Stephen. 1981. Biases in Dynamic Models with Fixed Effects. Econometrica, 49(6) : 1417-1426.

Plackett, Robin L. 1972. Studies in the History of Probability and Statistics. XXIX : The Discovery of the Method of Least Squares. Biometrika, 59(2) : 239-251.

Rao, C. Radhakrishna. 1946. On the Linear Combination of Observations and the General Theory of Least Squares. Sankhyã : The Indian Journal of Statistics, 7(3) : 237-256.

Rao, C. Radhakrishna. 1952. Advanced Statistical Methods in Biometric Research. New York, NY : Wiley.

Rao, C. Radhakrishna. 1970. Estimation of Heteroscedastic Variances in Linear Models. Journal of the American Statistical Association, 65(329) : 161-172.

Rao, C. Radhakrishna. 1971a. Estimation of Variance and Covariance Components -- MINQUE Theory. Journal of Multivariate Analysis, 1(3) : 257-275.

Rao, C. Radhakrishna. 1971b. Minimum Variance Quadratic Unbiased Estimation of Variance Components. Journal of Multivariate Analysis, 1(4) : 445-456.

Rao, C. Radhakrishna. 1972. Estimation of Heteroscedastic Variances in Linear Models. Journal of the American Statistical Association, 67(337) : 112-115.

Rao, C. Radhakrishna. 1979. MINQUE Theory and its Relation to ML and MML Estimation of Variance Components. Sankhyã : The Indian Journal of Statistics, 41, Series B : 138-153.

Rao, Poduri S. R. S. 1997. Variance Components Estimation : Mixed Models, Methodologies and Applications. London : Chapman & Hall.

Sahai, Hardeo, Andre I. Khuri, and C. H. Kapadia. 1985. A Second Bibliography on Variance Components. Communications in Statistical Theory and Methods, 14(1) : 63-115.

Sahai, Hardeo. 1979. A Bibliography on Variance Components. International Statistical Review, 47(2) : 177-222.

Scheffé, Henry. 1956. Alternative Models for the Analysis of Variance. Annals of Mathematical Statistics, 27(2) : 251-271.

Scheffé, Henry. 1959. The Analysis of Variance. New York, NY : Wiley.

Searle, Shayle R. and Harold V. Henderson. 1979. Dispersion Matrices for Variance Components Models. Journal of the American Statistical Association, 74(366) : 465-470.

Searle, Shayle R., George Casella, and Charles E. McCullough. 1992. Variance Components. New York, NY : Wiley.

Simpson, Thomas. 1755. A letter to the Right Honorable George Earl of MacClesfield, President of the Royal Society, on the Advantage of Taking the Mean of a Number of Observations, in Practical Astronomy. Philosophical Transactions of the Royal Society of London, 49 : 82-93.

Snedecor, George W. 1934. Calculation and Interpretation of Analysis of Variance and Covariance. Ames, IA : Collegiate Press.

Snedecor, George W. 1937-1980. Statistical Methods, 7th editions. Ames, IA : Iowa State College Press.

Sobel, Dava. 1995. Longitude. New York, NY : Penguin Books.

Stigler, Stephen M. 1986. The History of Statistics : The Measurement of Uncertainty before 1900. Cambridge, MA : Harvard University Press.

Stigler, Stephen M. 1999. Statistics on the Table : The History of Statistical Concepts and Methods. Cambridge, MA : Harvard University Press.

Thiele, Thorvald N. [1903] 1931. Theory of Observations. London : Layton. Reprinted in Annals of Mathematical Statistics, 2 : 165-307.

Trognon, Alain. 1978. Miscellaneous Asymptotic Properties of Ordinary Least Squares and Maximum Likelihood Methods in Dynamic Error Components Models. Annales de l’INSEE, 30-31 : 631-657.

Wallace, Thomas D., and Ashiq Hussain. 1969. The Use of Error Components Models in Combining Cross Section with Time Series Data. Econometrica, 37(1) : 55-72.

Wilks, Samuel S. [1943] 1962. Mathematical Statistics. Princeton, NJ : Princeton University Press, 1943. Second greatly augmented edition, New York, NY : Wiley.

Zellner, Arnold, Jan Kmenta and Jacques H. Dreze. 1966. Specification and Estimation of Cobb-Douglas Production Function Models. Econometrica, 34(4) : 784-795.

Haut de page


1 Lehmann (1993) gives a nice summary of the ideas of Fisher and Neyman and Egon S. Pearson about testing hypotheses, i.e. inference about the DGP. The key papers in the development of the Cowles paradigm are Haavelmo (1943, 1944) and Marschak (1950, 1953), discussed at some length below.

2 J. Bernoulli (1713) attempted to give a formal basis for the commonsense notion that the greater the accumulation of evidence about an unknown proportion of cases ("fertile" to total as he described it), the closer we are to knowledge of the "true" proportion, but that there is no natural bound in general which permits total elimination of all residual uncertainty. In his "limit" theorem, Bernoulli clearly shifts the emphasis from the observations themselves to inference about the underlying stochastic process which generates them.

3 The term inverse probability is not used by Laplace (1774) but appears, according to Edwards (1997, 178), for the first time in de Morgan (1838, Chapter 3).

4 De Moivre, of course, had it as early as 1733 in an unpublished memorandum later published as part of de Moivre (1738). See Hald (1998, 17-25).

5 It interesting to note that Marschak (1953) uses exactly the same justification for supposing that the disturbances in structural equations are normally distributed.

6 Reference to, and a brief discussion of Airy’s monograph, are to be found in Scheffé (1956), who credits Churchill Eisenhart for the reference.

7 William Chauvenet (1820-1870) was professor of mathematics at the U.S. Naval Academy in Annapolis from its founding in 1845 until his departure for Washington University in St. Louis in 1859, where he ultimately became Chancellor of the University. His book, A Manual of Theoretical and Practical Astronomy, went through many editions; the fifth and last, to which I have had access, was published in 1889.

8 Hald (1998, 675) mentions two additional precursors of Fisher: Edgeworth (1885) and Thiele (1903).

9 See the discussion of Hoch (1954) and Mundlak (1961) below.

10 If there are Q variables there are, in general, Image 100002010000006F00000034AE193820.pngmain and interaction effects. If all of them are present, the model is called saturated. If each variable is categorical, as is the case in the example, it does not require the number of parameters equal to the product of the number of categories for each variable included in an interaction to represent that effect but a considerably lesser number since the ANOVA restrictions imply that the unconstrained parameter values sum to zero over any index. In the case discussed above, for example, there are only two parameters required for each main effect, but four for each bivariate interaction, and eight for the single trivariate interaction.

11 This was not always so. Fisher’s battles with the experimental establishment to introduce randomization into experimental design, which had heretofore been systematic, are described in detail in Box (1978, 140-166).

12 See also Moran and Smith (1966). Fisher (1918b) was the paper submitted first to Biometrika that Karl Pearson rejected as editor. Relations between the two men were never the same after that!

13 Scheffé (1956), Anderson (1978), and Searle, Casella and McCulloch (1992, Chapter 2), give more details on the development of analysis of variance in the years following Fisher (1925).

14 See especially Rao (1946). Rao (1971a, b) works out details of the MINQUE alternative to maximum likelihood.

15 As, for example, in the ANOVA interpretation of log-linear probability models for the analysis of categorical data (Nerlove and Press, 1978, 1986).

16 Anderson and Bancroft (1952, 313-377) have five chapters (22-25) on variance-components models and a thorough discussion of method-of-moments estimation of such models. C. R. Rao’s book (1952) is restricted to linear models and doesn’t mention variance-components models, although he later published a number of papers dealing with minimum variance quadratic unbiased estimation (MINQUE) of variance-components models (1970, 1971a-b, 1972, 1979). See P. S. R. S. Rao (1997, Chapters 7-8) for a discussion of this and related methods.

17 At least part of the responsibility for the dominance of the fixed-effects model must be laid at the door of Fisher himself. See Aldrich (1999). Joan Fisher Box reports in her biography of her father (1978, 117) that in a 1922 paper, Fisher showed that "...the significance of the coefficients of regression formulae­ -- linear or curvilinear, simple or multiple -‑ could be treated exactly by Student’s test. Though in May Gosset had seemed convinced by Fisher’s argument to this effect, when he read this paper again after visiting Fisher, he became bothered about it; even as he was setting about the calculation of the new t- table, he was putting the problem to Fisher with a pertinacity that refused to be quieted, until he could be convinced that this use of his table was, in fact, correct. Gosset’s difficulty was one that has troubled other statisticians in other contexts. ...In making the regression of y on x and estimating the significance of deviations about the regression line Fisher had proved that the distribution of the ratio of a regression coefficient b to its stan­dard error followed the t - distribution. It was not obvious to Gosset that the resulting test was legitimate, because the sampling distribution of the x’s themselves was not taken into account. Fisher, however, argued that it was only the distribution of y’s relative to the fixed sample of values of x actually obtained, not to the population of the x’s that had relevance.”

18 See also Hemmerle and Hartley (1973).

19 Marschak and Andrews (1944). For an extended discussion see Nerlove (1965).

20 In the 1950 paper Hildreth refers to a still earlier paper, Hildreth (1949); also posted on the Cowles web site. In it Hildreth discusses a disaggregated system of behavioral relations and points out that what might be considered as predetermined in a time-series context might not be in a cross section. It is possible that this observation let him to the error-component formulation of the disturbances in the 1950 paper. I am indebted to Peter Phillips for retrieving the 1949 paper from the Cowles archives for me. And we are now all indebted to him for arranging for both Hildreth papers to be posted on the Cowles web site.

21 A referee notes that transforming (3) to its reduced form gives an error components, seemingly unrelated regression system of the form discussed by Avery (1977) and by Baltagi (1980).

22 For extended "variations on the theme" of Marschak and Andrews, see Nerlove (1965). The Marschak-Andrews paper gave rise to a considerable literature on the estimation of production functions in the context of models involving the kind of unobserved variations among firms detailed in the text. See, for example, Mundlak and Hoch (1965) and Zellner, Kmenta and Dreze (1966).

23 His dissertation title was "Estimation of Agricultural Resource Productivities Combining Time Series and Cross Section Data," University of Chicago, March, 1957.

24 In his more detailed summary, Hoch (1962) cites both Hildreth (1950) as his source for the idea of using the Analysis of Covariance and Wilks (1943) for its implementation. In Wilks (1943, 195-199), the Analysis of Covariance is explained as standard fixed-effects ANOVA in which the overall mean is a linear function of some continuously measured variables which are uncorrelated with the disturbance. It is interesting to note that variance components or random effects are nowhere mentioned in Wilks (1943), but in the greatly augmented second edition, Wilks (1962, 308-313) devotes five pages to the subject and cites Eisenhart (1947). Presumably on the basis of his earlier derivation of the likelihood equations for the random-effects model, Hildreth advised Hoch to adopt a fixed-effects framework, which by then dominated the statistical literature as well.

25 Consider a simple supply and demand model. Condition on price. Can one interpret the regression of quantity on price as the supply curve or as the demand curve? What is the appropriate interpretation?

26 The absurdity of the contention that possible correlation between some of the observed explanatory variables and the individual-specific component of the disturbance is a ground for using fixed effects should be clear from the following example: Consider a panel of households with data on consumption and income. We are trying to estimate a consumption function. Income varies across households and over time. The variation across households is related to ability of the main earner and other household-specific factors which vary little over time, that is to say, reflect mainly differences in permanent income. Such permanent differences in income are widely believed to be the source of most differences in consumption both cross-sectionally and over time, whereas, variations of income over time are likely to be mostly transitory and unrelated to consumption in most categories. Yet, fixed-effects regressions are equivalent to using only this variation and discarding the information on the consumption-income relationship contained in the cross-section variation among the household means. The problems of using fixed-effect models when there are time-invariant explanatory variables which ought to be included are addressed by Boumahdi and Thomas (2008) and by Biørn and Krishnakumar (2008), inter alia. Mundlak (1978a, 71-72) also makes the seemly contrary argument that if the individual-specific effects are linear functions of the individual-specific means of the observed explanatory variables plus an uncorrelated error, GLS estimation with error components collapses to fixed effects in the non-dynamic case. This is the basis for Chamberlain’s "-matrix" approach (Chamberlain, 1980, 1984); see also Crépron and Mairesse (1996). As Maddala (1987, 305) points out, however, this is no longer true if only some of the individual-specific means enter. It is also not true if the relationship to be estimated is dynamic or if the error includes a period-specific effect.

27 W. Brian Arthur has written extensively on path dependence in the economy (Arthur, 1989, 1994). It is most ubiquitous, however, in individual behavior. It lies behind all of the work starting with Irving Fisher on distributed lags. See Nerlove, 1968. In this paper, I ignore the difference between path and state dependence and use the two terms interchangeably. But see David (2001).

28 Throughout I use “bias” to refer to inconsistency of an estimate, with no possibility of confusion in the present context. As discussed below, I succeeded in reproducing these results of bias in Monte Carlo studies published in 1967 and 1971 (Nerlove, 1967, 1971a). It was not until much later, however, that the bias in the coefficient of the lagged dependent variable in the fixed-effects regression was understood (Nickell, 1981). The comparative properties of alternative estimators in dynamic panel models are summarized and illustrated in Nerlove (1999a). I only understood much later after rereading the paper Heckman presented at the 1977 Paris Conference on Panel Data Econometrics (1978) why both these biases occurred. See below.

29 In Balestra and Nerlove (1966), we cite Kuh (1959) as our authority. Kuh, in turn, cites nobody, but, interestingly enough mentions the possibility the effects of capital stock (equivalent to a lagged dependent variable) in an investment equation may be biased downwards by inclusion of fixed firm effects.

30 After we had prepared a draft of a report on the investigation, William Madow of SRI called our attention to Halperin (1951). Halperin uses the same orthogonal transformation and considers only the case in which no lagged value of the dependent variable is included as explanatory. Even in this case, however, Halperin remarks (1951, 574) that: "The maximum likelihood equations for the estimation of the parameters...are of such a formidable character that an explicit solution does not appear possible." Later, Hartley and Rao (1967) were independently to use the same transformation.

31 I was then, as now, a great fan of likelihood methods. Although I had acquired and almost certainly had read Hildreth (1950) ten years earlier when I was research assistant to Tjalling Koopmans and Jacob Marschak at the Cowles Commission then at the University of Chicago, the part I internalized was his characterization of the appropriate formulation of the disturbances in a panel context and not, fortunately as it turned out, his warning that the method of maximum likelihood was too difficult to be of use and that fixed effects would have to suffice.

32 See also Anderson and Hsiao (1981, 1982), Blundell and Smith (1991), below for more details, and the extended discussion in Nerlove (1999b).

33 Hildreth (1949) would have told us this, but we were unaware of this paper at the time. Consider

(*) Image 10000201000000D400000016E5E34980.png, i=1,...N, t=1,...T.

As remarked by a referee: This equation is equivalent to a form in which the dependent variable measured with error in an otherwise linear model in which the disturbance is now simply εit. Unfortunately, this does not resolve Hildreth’s problem of how to obtain maximum likelihood estimates of all the parameters with the computational power then at his disposal. Taking deviations (of all variables, both independent and dependent) from their overall means in (*) eliminates the constant α. The usual assumptions are made about the properties of the μi and the εit. Assume that, possibly after some differencing, both the yit and the xit are stationary. In this case, the initial observations are determined by

(**) Image 1000020100000182000000349D7905B0.png.

The joint distribution of yiT, ..., yi1, yi0 depends on the distribution of μi, εit, and xit. If yi0 is literally taken as fixed, which is to deny the plausible assumption that it is generated by the same process as generates the yit that are observed, the conditional likelihood function for the model (*) with uit = μi + εit ~ N(0, σ2Ω) is derived in the usual way from the product of the densities of yit conditional on xit and yit-1. Our formulation in 1963 treated yi0 exactly as if it were fixed just as one of the x’s. This is again the problem of fixed X regression (Aldrich, 1999); also clearly not legitimate in this case.

34 I was later able to reproduce such boundary solutions in the Monte Carlo results reported in Nerlove (1971a) but not in those reported for estimation in the case of a pure autoregression reported in Nerlove (1967). This was all later explained by Trognon at the first Paris conference in 1977 (Trognon, 1978).

35 The formula was later adapted in Nerlove (1967) to calculate the first-stage estimate of ρ from the fixed-effects regression. Unfortunately, the method of ensuring the first-stage estimate is non-negative has not been followed in many subsequent expositions and computer programs for doing panel data analysis with the result that the random-effects model is frequently rejected in favor of the fixed-effects model. When a lagged value of the dependent variable is included among the explanatory variables, the most common method of estimating  from a first-stage regression is almost bound to give a negative estimate (Nerlove, 1999a, 146-147).

36 Blundell (2001) has a good discussion of much of this work up to 2000.

37 I, myself, have worked on the closely related problem of the initial conditions in dynamic models to no very good effect (Chapters 7 and 8 of Nerlove 2002a), and have found no very good solution.

Haut de page

Pour citer cet article

Référence papier

Marc Nerlove, « Individual Heterogeneity and State Dependence: From George Biddell Airy to James Joseph Heckman »Œconomia, 4-3 | 2014, 281-320.

Référence électronique

Marc Nerlove, « Individual Heterogeneity and State Dependence: From George Biddell Airy to James Joseph Heckman »Œconomia [En ligne], 4-3 | 2014, mis en ligne le 01 septembre 2014, consulté le 14 juin 2024. URL : ; DOI :

Haut de page


Marc Nerlove

Department of Agricultural and Resource Economics, University of Maryland.

Haut de page

Droits d’auteur


Le texte seul est utilisable sous licence CC BY-NC-ND 4.0. Les autres éléments (illustrations, fichiers annexes importés) sont « Tous droits réservés », sauf mention contraire.

Haut de page
Rechercher dans OpenEdition Search

Vous allez être redirigé vers OpenEdition Search