The published version is available from the Bulletin of Mathematical Biology at http://link.springer.com/article/10.1007/s11538-013-9841-6

Published on: **Mar 4, 2016**

- 1. population genetics of gene functionIgnacio GalloAbstractThis paper shows that diﬀerentiating the lifetimes of two phenotypes independently fromtheir fertility can lead to a qualitative change in the equilibrium of a population: sincesurvival and reproduction are distinct functional aspects of an organism, this observationcontributes to extend the population-genetical characterisation of biological function. Tosupport this statement a mathematical relation is derived to link the lifetime ratio T1/T2,which parametrizes the diﬀerent survival ability of two phenotypes, with population vari-ables that quantify the amount of neutral variation underlying a population’s phenotypicdistribution.During the last decade experimental research has begun using population-genetical princi-ples to link the function of genes to their population distribution [12, 15, 14]. This closelyparallels the way in which statistical thermodynamics links the macroscopic state of aphysical body to the activity of the molecules that form it, and is likely to bear similarlysigniﬁcant consequences.Population genetics has studied the mathematical relation of evolutionary and demo-graphic forces to the population distribution of alleles for more than a hundred years,and bioinformatics has long used conserved sites in comparative statistical data to inferfunction [12]. More recent works can therefore be seen as a joint maturation of theselong-standing research threads, which enables to pose questions about the relation ofgene function to gene distribution in a more detailed manner than previously possible:the analogous maturation which gave birth to statistical thermodynamics consequentlyallowed to infer the size of a molecule from observable dynamics.One important point that has been stressed in recent works is that demographic forcescan mimic the eﬀect of selection during evolution, so that it is necessary to accountemail: ignacio@cantab.net1
- 2. for demography when inferring selection from population statistics. In [15] the authorsconsider the eﬀect of a changing population size, and show that a population’s pattern ofvariation may be used to infer this type of demographic information, as well as to locatefunctionally important genetic sequences.The present paper considers a further possibility granted by this approach, which liesin using the diﬀerent eﬀects of demography and reproductive selection on gene distributionin order to infer more detailed information about gene function itself.It needs to be stressed that, for the sake of exposition, this paper uses the term “genefunction” as relating to a univocally deﬁned concept, while yet acknowledging that thisconcept clearly doesn’t exist as such. One good way to justify such simplistic use lies inthe analogy with the term “the meaning of a word”, which similarly can sometimes beput to good use, while suﬀering from an analogous ambiguity of deﬁnition.Life-expectancy, a demographic parameter, is related to function: it reﬂects, statis-tically, the ability of an organism to perform the tasks which are required in order tosurvive for certain amount of time. If life-expectancy is systematically diﬀerent for twophenotypes in a given environment, it is natural to conclude that this diﬀerence is due tothe diﬀerent way in which the two phenotypes function in such environment.Therefore, if we consider survival and reproduction to be two macro-functions per-formed by any living organism, detecting a diﬀerence in phenotypic life-expectancy -separately from a diﬀerence in fertility- must allow to get a two-dimensional descriptionof the gene function which underlies the phenotypic change. This suggests that includ-ing the eﬀect of demographic forces may not only be used to infer information abouta species’ demographic history, but also to get more detailed information regarding thespeciﬁc function that a gene is playing in a given environment, using only informationwhich characterises the population as a whole.This paper shows that life-expectancy can aﬀect the nature of a population’s phe-notypic distribution in a way which is qualitatively distinct from the eﬀect of fertility.It is also shown that if a population contains two phenotypes characterised by diﬀerent2
- 3. life-expectancies T1 and T2, then the ratioλ =T1T2can be quantitatively estimated through the phenotypes’ respective amounts of neutralvariation, independently of all other parameters.Since a population’s dynamics is typically dominated by diﬀerences in fertility, stan-dard models tend to combine the eﬀect a genetic change has on an organism’s survivalability with its eﬀect on fertility [4, 11]. In this paper we study the eﬀect of a diﬀerence insurvival explicitly: we ﬁnd that this eﬀect is indeed small, but that it leads to qualitativechanges in a population’s equilibrium regime, which are likely to generate statisticallyobservable signals at a population as well as at a comparative level.The paper is organised as follows: in the next section we describe the empirical justiﬁ-cation for the structure of the chosen model, and we then present the model in section 2.Our modelling framework consists of two levels: one level studies the equilibrium reachedby two available phenotypes, and is analysed in section 3. It is at this level that we observea qualitatively novel equilibrium regime that results from diﬀerentiating the phenotypelifetimes separately from their reproductive rates.Section 4 focuses on the second level of description, which corresponds to the amountof neutral variation characterising each of the two considered phenotypes. We are inter-ested in this variation because it allows to express the parameter λ = T1/T2 in terms ofstatistically observable data.We conclude the paper by considering the practical limitations of the derived resultsand brieﬂy outlining possible further developments.1 Empirical backgroundWe want to construct a stochastic model for the dynamics of a population consisting oftwo phenotypes which diﬀer both in their average amount of oﬀspring (which we call W1and W2, respectively), and in their average lifetimes (T1 and T2). To this end, in section 2we modify the haploid Moran model by a qualitative change in its process, which we3
- 4. Figure 1: Structure of the considered population: organisms carry one of two phenotypes,P1 and P2, each of which is coded by many synonymous genotypes that are assumed to beneutral with respect to each other.attempt to justify on intuitive grounds, and which we parametrize in terms of a novelparameter, λ = T1/T2.Since the model includes one more parameter than the standard setting, it is desir-able to correspondingly expand the number of independent quantities that we expect toobserve, so to allow the discrimination among possible causes of speciﬁc states of thepopulation. We address this need by considering the amount of synonymous variationincluded in each of our two phenotypes, which, following [15], we consider to be neutral:we therefore have a population consisting of two phenotypes coded by a larger number ofgenotypes, which are neutral with respect to each other as long as they give rise to thesame phenotype (Figure 1).This population structure parallels the structure of variation encountered by MartinKreitman when studying the alcohol dehydrogenase locus in D. melanogaster, where hefound that only one of the 43 polymorphic sites observed in his sample led to a change inthe protein coded by the gene, thus revealing that the gene consists of only two molecular4
- 5. phenotypes coded by a larger number of genotypes [9].Though the model we consider is clearly much too simplistic to apply to a naturalpopulation of Drosophila, Kreitman’s observation gives empirical justiﬁcation for buildinga model which allows only one of the sites of a long genetic sequence to lead to a changein phenotype, which would a priori seem to be a very strong assumption.We will see that by using such simpliﬁcation -which is further supported by moreextensive studies [1]- our model allows both to derive a clear characterisation of thephenotypic equilibria (section 3), and to estimate the model’s novel parameter λ (section4) in terms of statistically observable quantities, by using variations of standard results.2 Modelling approach and its relation to the stan-dard settingThe structure of our model reﬂects the empirical situation described in the last section:we consider a population of N organisms carrying genotypes of length L, where eachgenotype-site can take two possible states: however, only one of these sites corresponds toa change of phenotype, whereas mutations in other sites are neutral. It is worth pointingout that this is the same structure of the Drosophila haplotype data included in theappendix of [1], where only mutations in one speciﬁc molecular-marker site lead to anamino acid change.The model therefore includes two levels of description: a phenotypic one, which de-scribes the dynamical change in the number of individuals carrying the two availablephenotypes, and a genotypic level that characterises the amount of neutral variationavailable for each of the two phenotypes.We use the symbols P1 and P2 to denote our two phenotypes, and we keep the pop-ulation size ﬁxed at a value N: we can therefore specify the phenotypic state of the5
- 6. population by a random variable X which givesX : number of individuals with phenotype P1,N − X : number of individuals with phenotype P2.We are interested in studying the stochastic equilibrium reached by a process of death,reproduction and reversible mutation, where mutation happens with the same probabilityu in both directions, and for all the available sites, including the phenotypically-linkedone.The assumption of a mutation rate which is both symmetric and site-independent isvery idealised, and even more importantly, an equilibrium due to reversible mutation isnot generally considered to be relevant for generic mutations [7].Due to its simplicity, however, the chosen setting allows to show with remarkableclarity that an explicit diﬀerence in the phenotype lifetimes leads to a qualitative noveltype of equilibrium state: this is the aim of this paper, since it is in this novelty that wesee potential to extend the standard population-genetical characterisation of biologicalfunction; having a full characterisation of this elementary case should be useful whenconsidering more realistic and analytically challenging situations.In the next subsection we therefore describe a variation of the Moran process which,as we will attempt to justify on intuitive grounds, provides a good model for phenotypesthat are allowed to diﬀer independently in lifetime and average number of oﬀspring. Itis interesting to point out that Moran introduced in Ref. [11] a variation of his originalmodel that implements reproductive selection by diﬀerentiating his alleles’ lifetimes whilekeeping the instantaneous reproductive rates the same for all his phenotypes (thus dif-ferentiating their life-long reproductive yields): Moran remarks that considering lifetimeand reproductive diﬀerences separately would almost certainly make no diﬀerence for theequilibrium distribution. Our implementation of the phenotypic diﬀerence comes as anatural extension of Moran’s, and our claim is that, though the subtlety of this exten-sion’s eﬀect roughly conﬁrms his intuition, the qualitative nature of this change providesconsiderable descriptive potential.6
- 7. As stressed before, when introducing a quantity λ to parametrize the diﬀerentiationin the lifetimes, it becomes desirable to extend the set of quantities that we expect toobserve in the statistics of the population: in other words, though the quantityx =XNfully characterises our population’s phenotypic state, we can gather further information bylooking at how the X individuals carrying phenotype P1 are partitioned into synonymousgenotypes, and similarly for the N − X individuals carrying phenotype P2.A natural intuitive choice to characterise neutral variation in the two phenotypes wouldbe to count the actual number of synonymous genotypes present for each: however, asremarked in [4], an interesting alternative is to use the inbreeding coeﬃcient concept.The inbreeding coeﬃcient is typically used for diploid organisms, since it is deﬁnedas the probability that a given genetic locus is homozygous: that is, it is the probabilitythat for a diploid organism chosen at random from a population, the two alleles that theorganism contains at such locus are found to be identical. However, under the assumptionof random mating, this turns out to be equivalent to the probability that any two allelesdrawn at random from the population are identical, regardless of the separation intoorganisms.The latter deﬁnition makes the quantity relevant to haploid populations, and it turnsout to be a more analytically accessible one than the aforementioned actual number ofsynonymous genotypes, at least for the estimation purpose considered in section 4, inwhich we generalise a result taken from [8]. The inbreeding coeﬃcient also provides aneﬀective approximation to the actual number of synonymous alleles, and it has beensuggested to be more empirically accessible than the latter quantity [4].2.1 Phenotypic levelHere we describe the process through which we model the change in our population’sphenotypes, and make our attempt to justify the modelling choice through intuition: theobservable outcome of this process is analysed in section 3.7
- 8. Studies using population genetics to infer function are typically based on the Wrightmodel [12, 15], which describes the stochastic change of a population at discrete non-overlapping generations, and it is customary to describe the allele dynamics by using acontinuous approximation to this process. There is, however, a somewhat paradoxicalaspect to this standard modelling approach: whereas the Wright model describes thepopulation as changing in discrete generations which might last for a considerable amountof time, the continuous approximation requires such generations to be taken of vanishingduration.This assumption is well justiﬁed by the fact that processes are often considered tobe taking place on an evolutionary time scale, which is much longer than the generationtime, as well as by the fact that if one assumes organisms not to be subject to ageing,which is virtually unavoidable at the simplest level of description, the Wright model isformally equivalent to a process of death and birth [3].On the other hand, the point of view of this paper is that, though suitable giventhe typical assumptions, the reliance on a discrete generation setting might hinder theconsideration of relevant modiﬁcations: here we propose a modiﬁcation to the Moranmodel that stems from features of an instantaneous event, which are prohibitively diﬃcultto visualise at a generation level.The Moran model describes the process of change in a population consisting of twotypes of organism. A time interval in this model is deﬁned by the occurrence of a deathevent, followed by a birth event. It is sensible, in principle, to deﬁne time through theseevents, since these change the state of the population, which is the object under study.There is a reason, however, to regard time as ﬂowing according to an external frameof reference, thus including time intervals during which no population events happen:death events may be thought to take place at a rate determined by the physiology of theorganisms; due to competition, however, birth events may be thought to happen instan-taneously when the death of an individual makes a safe spot available in the environment.In fact, both the Wright and the Moran model set the population size to be ﬁxed ata value N, which represents the environment’s carrying capacity: the meaning of this8
- 9. (a)(b)Figure 2: Here we illustrate the diﬀerence between the (a) Moran process and (b) theprocess considered in this paper. At every time interval in (a) an organism is chosen todie, and a new one is chosen to replace it: the types of the dead and newborn organismsdetermine change in phenotype P1’s frequency X as −1, 0 or 1. The loop in (b) showsthat in the present model time ﬂows according to an external reference: this allows tocharacterise the diﬀerent nature of birth and death events.9
- 10. constraint is that a death event should be interpreted as one which vacates an environ-mental safe spot, and we argue that the presence of competition makes it intuitivelyadmissible to model the subsequent birth event as happening instantaneously as soon asthe environmental spot becomes available.As a consequence, having an external time reference allows to describe more adequatelythe interplay between the two diﬀerent types of competition which characterise 1) anorganism’s struggle to survive, and thus to preserve its environmental spot, and 2) thereproductive struggle to occupy all spots as soon as they become available.According to this interpretation, all organisms in a population might be thought to beplaying a waiting game similar to the children’s game “musical chairs” where N −1 chairsare available for N children to sit on when the music stops. This process has already beenconsidered in an evolutionary setting in [2]: in our case, rather than focusing on modellingthe competitive game which determines the allocation of an available spot, we considerthis allocation to happen trivially and instantaneously, and we focus on the waiting gameitself.In practice, our modiﬁcation of the Moran model consists of a process which allowsat most one death-birth event per time interval rather than exactly one as in the originalmodel [10]: the two diagrams in Figure 2 illustrate this diﬀerence.Figure 2(a) shows the change in the phenotype frequency X during a time interval inthe original Moran model: an individual is chosen at random from the population andkilled, and then replaced by a new individual. The change in X is then determined bythe phenotypic identity of the dead and of the newborn.Diﬀerently from Figure 2(a), Figure 2(b) (which describes our process) contains aloop at the origin of the diagram. This formalises the diﬀerent nature of the death andbirth events: at a given time instant no organisms might die; when a death does happen,however, a birth systematically follows instantaneously.This modelling choice is arbitrary: contrarily to our assumption, following a deathcompetitive conﬂicts between organisms might lead to a substantial delay in the allocationof the newly-vacated spot, and this could considerably change the nature of the process.10
- 11. This objection, however, only highlights the descriptive potential of a modelling approachthat uses intuition to consider fundamental population events in some detail, an approachfor which a considerable gap exists in the mathematical biology literature: here we lookat the consequences of a simple such possibility.The model is therefore deﬁned by the transition probabilities p−, q−, p+and q+inFigure 3, which are in turn derived from the life-cycles of the two organisms.Transition probability p−corresponds to the event that an organism with phenotypeP1 dies, whereas q−corresponds to the same event for phenotype P2.We denote the relative frequency of phenotype P1 by x = X/N, and its average lifetimeby T1: under the assumption that each organisms is reproductively mature at birth andis not subject to ageing, we have thatp−=xT1, and q−=1 − xT2. (2.1)In this paper we refrain from giving a fully detailed derivation of these formulas: theissues involved in the rigorous foundation of this level of modelling are problematic, andthis is indeed related to the fact that variations such as (2.1) are not often encounteredin the literature. This paper rather tackles foundational issues by proposing (2.1) as aspeciﬁc variation of the standard approach.The technical aspects of the derivation of (2.1) are not, however, fundamentally dif-ferent from those encountered in the Wright and the Moran models, and the connectioncan be intuitively clariﬁed by the following observation. The quantity p−is the productof 1) the Moran-like probability that an organism carrying phenotype P1 is chosen to die(x = X/N), and 2) the probability that it actually dies. The latter probability, whichwe can call δ1, corresponds to the fact that in our model organisms are always given achance to survive: this can be given a more fundamental justiﬁcation if one considers amodel in which an arbitrary number of organisms can die at any given time interval. Weshall, however, refrain from pursuing this line of reasoning further, and leave it for a morespeciﬁc future work.We want our model’s parameters to correspond to biological features: assuming thatour organisms are not subject to ageing, or to environmental ﬂuctuations, we have that11
- 12. Figure 3: Transition probabilities for the fundamental population events in our process:p−and p+correspond to death and birth (after mutation) of organisms with phenotypeP1. Similarly we have q−and q+for P2, and the existence of the loop at the origin of thediagram is due to that in general p−+ q−< 1.their average lifespan is equal to the mean of a geometric distribution with parameter δ1(for phenotype P1), which leads toT1 =1δ1.This gives p−in (2.1), and the same reasoning applies to q−.In view of (2.1) we have that in generalp−+ q−< 1,and this is the cause of the qualitative eﬀect arising from diﬀerentiating the phenotypiclifespans, which gives the model’s novelty.The second biological feature which we assign to our phenotypes is the average numberof oﬀspring produced by an organism during its entire lifetime, which we denote by W1and W2 for phenotypes P1 and P2, respectively.Transition probabilities p+and q+are obtained by considering elementary events in asimilar way as for p−and q−, taking into account that reproduction involves also mutation,which we model as happening with probability u in both directions.12
- 13. Under these life-cycle conditions it can be shown thatp+=W1T1(1 − u) x +W2T2u (1 − x)W1T1x +W2T2(1 − x),andq+=W1T1u x +W2T2(1 − u)(1 − x)W1T1x +W2T2(1 − x).The reason for the denominators in p+and q+is that, as we stressed before, a re-production event is assumed to happen instantaneously when a death event vacates anenvironmental spot, so that a “death followed by no birth” is not considered to be apossible event. This determines the “musical-chairs” nature of the model, which we claimto be a particularly insightful way of modelling a process of competition [2], and whichquantitatively corresponds top++ q+= 1.We are particularly interested in including parameter values for which the equilibriumdistribution does not become trivial in the limit of large population size, and to this endwe employ the following asymptotic scalings for the mutation parameterN u −→N→∞θ,and for reproductive selection parameterNW1W2− 1 −→N→∞s,which we use as deﬁnitions for the rescaled parameters θ and s. For convenience we alsouse the parameter λ for the ratio between lifetimes:λ =T1T2.In section 3 we will need the ﬁrst two moments of the change in the variable X at agiven time to write down the large population size limit for the equilibrium distributionattained by the phenotypes. To this end, after deﬁning∆X(x) = Xt+1 − Xt ,13
- 14. we need to compute quantities M(x) = E ∆X(x) and V (x) = E (∆X(x)2in termsof the transition probabilities deﬁned above. The functional dependence on x shows thatthis moments are computed conditionally on the relative frequency of phenotype P1 beingequal to x = X/N: for convenience, however, from now on we drop the x dependencefrom the notation.Using inspection on Figure 3 we ﬁnd thatM = limN→∞(q−p+− p−q+),and thatV = limN→∞(q−p++ p−q+),which in terms of the asymptotic parameters givesM =1NT1·θ λ2(1 − x)2+ λ s x(1 − x) − θ x2x + λ(1 − x),andV =2λNT1·x (1 − x)x + λ(1 − x).We see that the novelty of the model is nicely shown algebraically by the presence ofa factor (x + λ(1 − x)) in the denominator of both M and V , its presence in the latterbeing particularly signiﬁcant for the form of the equilibrium distribution: we discuss theanalytic consequences of this in section 3.2.2 Neutral variationUnderlying the process of change in the phenotypic frequencies we have the process ofcreation of new neutral mutations to phenotypes P1 and P2, and of their stochastic loss.As mentioned in the introduction to this section, we model each genotype as a se-quence of L two-state sites, which includes a site (the “phenotypically-linked” site) whosemutation causes the change between phenotypes P1 and P2, and for the sake of simplicitywe make the rather strong assumption that mutation happens with same probability uat all sites, and in both directions: therefore, the probability of mutation u relevant to14
- 15. the phenotypic equilibrium also parametrizes the amount of neutral variation for the twophenotypes.Like we said before, rather than using the actual number of neutral genotypes intowhich each phenotype is partitioned, we choose to characterise neutral variation by theinbreeding coeﬃcient. For a population of haploids, such as the one we consider, theinbreeding coeﬃcient can be deﬁned as the probability that two genotypes drawn atrandom from the population are identical.Therefore, in addition to random variable x that characterises the population’s phe-notypic distribution, we deﬁne the two quantitiesF1 = probability that two organisms with phenotype P1 have the same genotype,F2 = probability that two organisms with phenotype P2 have the same genotype.In section 4 we ﬁnd an explicit formula for the new parameter λ in terms of com-bined moments of quantities F1, F2 and x: in this paper’s point of view such a relationcontributes to extend the population-genetical characterisation of biological function.Kimura and Crow [8] ﬁnd that in an “inﬁnite alleles” model, which compared to themodel presented here may be thought of as one where only one phenotype exists, andwhere genotypes have an inﬁnite number of sites, the inbreeding coeﬃcient is on averageequal toF ≈12 N u + 1.In section 4 we generalise their calculation to include our case, and see how the result canbe used to express the parameter λ = T1/T2 in terms of statistically observable quantities.3 Phenotypic equilibriumHere we describe the equilibrium distribution attained by the population’s phenotypesP1 and P2 under our process of selection and reversible mutation. It is worth stressingimmediately that the qualitative novelty of including the diﬀerentiation of phenotypelifetimes manifests itself analytically in the probability density function of P1’s relative15
- 16. frequency x,φ(x) = Ceα xxλ θ−1(1 − x)θ/λ−1x + λ(1 − x) ,through the factor (x+λ(1−x)), which is not usually seen in population genetics models.When λ = 1 this factor is equal to one, and we recover the typical equilibrium distributionfor a haploid population under reversible mutation and selection.3.1 Form of the equilibrium distributionThe form of the phenotypic equilibrium distribution can be obtained in a large populationsize approximation by using Wright’s formulaφ(x) =CV (x)exp 2M(x)V (x)dx , (3.2)where C is a normalisation constant.This formula was derived by Wright [16] for a process divided into non-overlappinggenerations, and it has been proved by Moran to be applicable to various versions ofhis model [11]. More importantly for our case, Cannings [3] has shown by a conciseobservation that an overlapping generations model can be considered formally equivalentto a non-overlapping one as long as organisms are not subject to ageing, and this allowsus to use approximation (3.2) in its full generality.According to the last section our model givesM =1NT1·θ λ2(1 − x)2+ λ s x(1 − x) − θx2x + λ(1 − x),andV =2λNT1·x(1 − x)x + λ(1 − x),so our model’s equilibrium φ for P1’s relative frequency x = X/N takes the following formφ(x) = Ceα xxλ θ−1(1 − x)θ/λ−1x + λ(1 − x) ,whereα = s + θ1λ− λ ,16
- 17. x (relative frequency of P1)probabilitydensity(a)probabilitydensityx (relative frequency of P1)(b)probabilitydensityx (relative frequency of P1)(c)Figure 4: These are the three basic types of equilibrium distribution which can be attainedat a phenotypic level, and which correspond to diﬀerent intensities of mutation in thefollowing way: (a) N u 1, (b) N u 1, (c) N u ≈ 1. Regime (c) is caused by thediﬀerentiation of the phenotype lifetimes, and appears when λ = 1.17
- 18. and C is the normalization constant which ensures that10φ(x)dx = 1.We see that for λ = 1 (that is, when T1 = T2) we recover the equilibrium distributionfor a typical haploid random drift model with reversible mutation. When λ = 1 wehave that the factor (x + λ(1 − x)) produces a qualitative diﬀerence in the shape of thedistribution, though this only happens for values of θ = N u close to one: Figure 4 showsthe shape of the equilibrium distribution for diﬀerent values of the mutation parameter.We ﬁnd that the there is an intermediate regime (Figure 4(c)) between the typicallow-mutation U-shape (Figure 4(a)) and the high-mutation bell-shape (Figure 4(b)). Theintermediate regime admits two stationary points, and as a consequence becomes bimodal.In general, we have that for any parameter value the type of equilibrium can be charac-terised in terms of the number of stationary points which the distribution exhibits: wecarry out such characterisation in the next subsection.3.2 Diagram characterising the equilibrium population’s modesIn the last subsection we saw how a diﬀerence in the lifetimes of our two phenotypes canlead to a new type of equilibrium regime for the population, which consists of a hybridbetween the classical low and high mutation regimes: this new regime exhibits both a localprobability maximum (as in the high-mutation case) and a maximum at the boundary(as in the low-mutation regime).Looking at the the shapes of the equilibrium distributions in Figure 4, we see thatthese shapes can be well classiﬁed in terms of the number and nature of their stationarypoints, which in turn determine the number and location of the distribution maxima, ormodes.Figure 4(a) has only one stationary point, which is a minimum, and this implies theexistence of two modes at x = 0 and x = 1 (the probability density in fact diverges toinﬁnity at these boundary points, though this singularity does not aﬀect the possibility ofnormalising the distribution): this characterises the regime of low mutation as one where18
- 19. the population polarises about one of the phenotypes, and rarely switches to the other.Figure 4(b) also has only one stationary point, which is a local maximum: this suggeststhat the dynamics of the population in this regime will typically be one where a mixedphenotypic state ﬂuctuates around this maximum.Figure 4(c) shows a novelty of the present model: we have two stationary points,a local maximum and a local minimum, which implies the existence of a second modealso at one of the boundaries. This suggests that the dynamics will tend to polarise thepopulation around one speciﬁc phenotype: the existence of the local maximum, however,suggests that this state of polarisation should be periodically lost in favour of a mixedconﬁguration for the two phenotypes, and that this switch should happen considerablymore often than the switch between the polarised states of (a). This, however, can onlybe elucidated by considering the model’s dynamics explicitly, and could be the topic of afuture work.In order to understand how parameter values relate to cases (a), (b) and (c), we use theprobability distribution function we obtained in the last section to locate the stationarypoints of the distribution φ.The conditiond φ(x)d x= 0,which gives the stationary points, leads to the following rational equation for x:α +ax−b1 − xx + λ (1 − x) + 1 − λ = 0, (3.3)wherea = λ θ − 1, b =θλ− 1, and α = s + θ1λ− λ .Multiplying equation (3.3) by factors x and (1 − x), we obtain a polynomial equationof degree 3, which admits three solutions. We are, however, only interested in solutionslying on the real interval going from 0 to 1, since these correspond to meaningful valuesfor the relative frequency x.Rather than solving the cubic in x, we can use (3.3) to ﬁnd a functional expressionfor θ = N u (for which equation (3.3) is linear). By considering θ as a function of x, and19
- 20. (a) (b)Figure 5: Stationary points for the equilibrium distribution (a) in the classical case whereλ = T1/T2 = 1, and (b) for λ = T1/T2 = 3/2. Each line corresponds to a diﬀerent valueof the selection coeﬃcient “ s”: dashed lines are for local minima, solid lines for localmaxima. When λ = 1 bimodal equilibrium states exist: the dotted line corresponds to thevalue of the mutation probability “ u” which gives rise to the distribution in Figure 4(c),for the value of the selection coeﬃcient “ s” corresponding to the line highlighted in red.looking at this function for diﬀerent values of parameters s (reproductive selection) andλ (the lifetime ratio), we get a full characterisation of the distribution’s stationary pointsand, as a consequence, of its modes: we do this in Figure 5.Figure 5 shows the stationary points for the equilibrium distribution: the two picturescorrespond to the two diﬀerent cases λ = 1 and λ = 1. Diﬀerent lines correspond todiﬀerent values of the selection coeﬃcient s, for which they give the dependence of theposition of the stationary points on the mutation probability u. Dashed lines correspondto local minima and solid lines to local maxima.Figure 5(a) shows the classical situation where T1 = T2 (λ = 1). As expected, we ﬁnd20
- 21. an abrupt transition in the nature of the stationary points when the mutation probabilityu = 1/N, at which value the equilibrium distribution turns from being U-shaped to beingbell-shaped: however, the diagram shows that all parameter values except u = 1/N leadto only one stationary point in the equilibrium distribution. Highlighted in red we see thefunctional dependence of the unique stationary point on the mutation probability, for aparticular value of the reproductive selection coeﬃcient s.Figure 5(b) shows the same diagram for λ = T1/T2 = 3/2: we see highlighted in red thedependence of the equilibrium distribution’s stationary points on the mutation probability,for the same value of reproductive selection s used in for the red curve in Figure 5(a). Thediagram shows how near u = 1/N there are regions where two stationary points coexistfor the same distribution, a situation which is not encountered in classical populationgenetics models for haploid populations. The dotted line, which shows an example ofthis, corresponds to the regime in Figure 4(c).An important fact which we learn from Figure 5 is that the diagram for λ = 1 isnot robust with respect to changes in the parameter values, whereas it is robust for anyother value of λ. This means that for any λ = 1 the diagram will exhibit regimes wherethe equilibrium distribution admits more than one stationary point, and the transitionbetween the diﬀerent qualitative types of equilibrium will come about through the sametype of bifurcations which we see in Figure 5(b).4 Estimation of λ =T1T2In this section we derive the parameter λ = T1/T2 from population data, and in partic-ular from statistical quantities characterising the amount of neutral variation underlyingphenotypes P1 and P2. As mentioned above, in order to quantify the amount of neu-tral mutation we use the inbreeding coeﬃcient concept, by deﬁning the following twoquantitiesF1 = probability that two organisms with phenotype P1 have the same genotype,F2 = probability that two organisms with phenotype P2 have the same genotype.21
- 22. An important reason for using this approach is that it allows to extend an intuitiveresult obtained by Kimura and Crow in [8], where the equilibrium value for the inbreedingcoeﬃcient was computed for a population consisting of only one phenotype, rather thanof two phenotypes like in our case.The basic observation that allows Kimura and Crow’s calculation is that, if we denotethe inbreeding coeﬃcient for their unique phenotype by F, and if we assume the populationchanges according to Wright’s process, in the absence of mutation the value of F changesaccording to the following equation:F(t + 1) =1N+ 1 −1NF(t). (4.4)The intuitive reason for this is that at generation t + 1 any two individuals have a prob-ability 1/N of being born from the same parent: if they do, in the absence of mutationthey share the same genotype with probability 1, which gives the ﬁrst term on the lefthand side of (4.4).In the presence of mutation, assuming that each mutation generates a mutation whichpreviously didn’t exist,F(t + 1) =1N+ 1 −1NF(t) (1 − u)2,which leads to the the equilibrium average valueF =1 − 2 u2 N u − 2 u + 1≈12 N u + 1. (4.5)The “inﬁnite alleles” assumption, according to which each new mutation produces agenotype not contained in the population, is equivalent to assuming that the length Lof our genotype is very large, and we shall be making the same assumption in order togeneralise (4.5).The line of reasoning used to obtain (4.5) can be extended to our situation, where ahaploid population subdivided into two phenotypes P1 and P2 changes by single death-and-birth events, rather than at discrete generations: since this setting is less symmetricthe details are more articulated, though the idea behind the calculation remains the same.22
- 23. In our model the change in F1 at time t depends 1) on whether a death has takenplace, 2) on the phenotype of the organism that died, and 3) on the phenotype of thenewborn (taking into account the possibility that a mutation might have taken place).To reduce the complexity of the calculation we make the assumption that a newborn,before mutation, shares the same phenotype of the organism which has died. This wouldonly strictly hold if the relative sizes for phenotypes P1 and P2 stayed a ﬁxed throughoutthe process: in Figure 6 we show, however, the result of a simulation that suggests thatour assumption leads to a formula which gives a rather precise approximation, and thisis suﬃcient to support the paper’s claim that it is theoretically feasible to extend thepopulation-genetical characterisation of biological function.To compute the change in F1 we therefore need to consider three cases for the type ofevent taking place at time t:A: an organism of type P1 dies, is replaced by a newborn of the same type, and thenewborn might mutate, either at the phenotypically-linked site, or at one of theneutral sites,B: an organism of type P2 dies, is replaced by a newborn of the same type, and thenewborn might mutate, either at the phenotypically-linked site, or at one of theneutral sites,C: no death takes place, so no replacement happens.According to our process, and taking into account our simplifying assumption -thatsets the phenotype of a dead organism equal to that of the subsequent newborn- theprobabilities of events A, B and C are:P(A) =xT1, P(B) =1 − xT2, P(C) = 1 −xT1−1 − xT2.Keeping in mind that the quantity F1 is deﬁned as the probability that two organismschosen at random from the population and which have phenotype P1 also share the samegenotype, we now need to ﬁnd how F1 changes in each of the three cases A, B and C.————————23
- 24. Preliminary calculation of sampling probabilities:The probability F1(t + 1) is associated with a couple of organisms drawn at randomfrom the population, and therefore it depends on whether one of the two sampled organ-isms happens to be the one which was born during the last step. In particular, we needto distinguish the following three sampling events:S1 : the newborn organism is chosen in the sampling, but its parent is not,S2 : the sampled couple consists of the newborn and its parent,S3 : the newborn is not sampled.Assuming that the organisms are sampled from the population without reinsertion, theprobabilities of events S1, S2 and S3 are as follows:P(S1) =1X+1X − 1−3X(X − 1),P(S2) =2X(X − 1),P(S3) = 1 −1X−1X − 1+1X(X − 1),where, like before, X is the total number of individuals with phenotype P1.Using these probabilities we can now ﬁnd F1(t + 1) in each of the three cases A, Band CCase A:To see how F1 changes after an event of type A we need to know two things:- whether the newborn mutated (and whether the mutation happened at the sitelinked to the change in phenotype),- whether one of the two organisms which are selected at random to compute theprobability F1 is the newborn (and whether the other happens to be its parentorganism).If a mutation happens at the phenotypically-linked site, any two organisms of type P1sampled at time t + 1 will have the same probability of sharing the same genotype, as24
- 25. they did at time t: since the probability of mutation at any site is u, this will contributeu F1(t)to the average value of F1 at time t + 1, conditional to event A.If a mutation happens at a neutral site, the probability of the newborn having thesame genotype as any of the other organisms is zero (this is a consequence of assumingthat L is large enough for each neutral mutation to produce a totally new genotype).Since the probability of a neutral mutation is (L − 1) u, we have that in this case thecontribution is0 · (L − 1) u P(S1) + P(S2) + F1(t) · (L − 1) u P(S3) = F1(t)(L − 1) u P(S3),The ﬁrst term on the left-hand-side corresponds to the event that the newborn is chosenat the sampling (events S1 and S2 above), whereas the second term, which gives the non-zero contribution, corresponds to the fact that the probability F1 remains unchanged aslong as none of the chosen organisms is the newborn (event S3).Finally, for the case in which no mutation happens at any site, which has probability(1 − L u), we get the following contribution:(1 − L u) F1(t) · P(S1) + P(S3) + 1 · P(S2) ,where the second term corresponds to the fact that, as long as no mutation takes place,the probability that the newborn shares its genotype with its parent is equal to 1.Therefore, if we denote the value of F1(t + 1) conditional to event A by F1(t + 1|A),summing all three contributions we getF1(t + 1|A) = u F1(t) + F1(t)(L − 1) u P(S3) + (1 − L u) F1(t) · P(S1) + P(S3) + P(S2) .Case B:In this case we have a newborn of type P2. Like for case A, we use the term F1(t + 1|B)to the denote the new value of F1 conditional to B, and we have thatF1(t + 1|B) = 1 − u P(S1) + P(S2) F1(t).25
- 26. It’s straightforward to see why: if the newborn is of type P2, the inbreeding coeﬃcientremains unchanged unless the newborn mutates in the phenotypically-linked site and issubsequently chosen in the sampling: the probability of this event is u · P(S1) + P(S2) .Case C:In this case nothing happens, so we have thatF1(t + 1|C) = F1(t).————————We can now write the value of F1(t + 1) as the sum of the conditional contributionsmultiplied by their respective probabilities:F1(t + 1) = F1(t + 1|A) P(A) + F1(t + 1|B) P(B) + F1(t + 1|C) P(C),and we can use symmetry to obtain an analogous relation for F2.In the limit of large N these two relations simplify substantially, so that up to order1/N2we getF1(t + 1) − F1(t) =2N2 T11x−1xF1(t) 1 + θ λ(1 − x) + x (L − 1) ,F2(t + 1) − F2(t) =2N2 T211 − x−11 − xF2(t) 1 + θ1λx + (1 − x)(L − 1) ,where the terms x and (1 − x) at the denominator arise from the sampling probabilitiesP(S1), P(S2), P(S3).Therefore, denoting by · the average with respect of all realisations of our process,we get the following relations linking the moments of the observable quantities to themodel parameters:F1/x + θ λ( F1/x − F1 ) + (L − 1) F1 = 1/x ,F2/(1 − x) + θ1λ( F2/(1 − x) − F2 ) + (L − 1) F2 = 1/(1 − x) .26
- 27. 0.5 1 1.5 20.75 1.25 1.750.511.52.751.251.75RealEstimatedT1= 2 T2T1= 1/2 T2Figure 6: Comparison of the actual value of λ = T1/T2 and the value estimated from Eqn.(4.6), for simulations using values ranging from λ = .5 to λ = 2. The moments neededfor Eqn. (4.6) are estimated from 10000 process realisations for each value of λ; the otherparameters are s = −5, u = 0.007, N = 1000 and L = 40.Though the notation is somewhat cumbersome, it is easy to see that these two equa-tions oﬀer a relation between the model parameters θ and λ and the averages of the sixrandom quantitiesF1, F2,1x,11 − x,F1x,F21 − x,all six of which are in principle statistically observable.Since this paper focuses on the parameter λ = T1/T2, in virtue of its putative relevancein terms of function, we proceed by solving both equations for θ, and equating them inorder to ﬁnd a relation for λ.In order to express the mentioned relation in a more compact form we deﬁne the27
- 28. following auxiliary quantitiesR =F2F1·1/x − F1/x1/(1 − x) − F2/(1 − x),Q1 =F1/xF1− 1, Q2 =F2/xF2− 1.In terms of these quantities, the equation for λ takes the following formR = λλQ1 + L − 1Q2 + λ(L − 1),and this relation leads to a quadratic equation that only admits one non-negative solution:λ =12Q1(R − 1)(L − 1) + (R − 1)2(L − 1)2 + 4RQ1Q2 . (4.6)Figure 6 shows the result of using formula (4.6) to estimate λ, for a series series ofsimulations where the real value of λ ranges from .5 (i.e. T1 = 1/2 T2) to 2 (i.e. T1 = 2 T2):we see that the average values of such estimations are well aligned with the actual values.The magnitude of the standard deviation for our estimations, on the other hand, isconsiderable, especially in view of the fact the 10000 realisations of the process were usedto estimate each value of λ: it is clear that a substantial increase of eﬃciency will beneeded to make the theory relevant to actual empirical phenomena.This practical consideration should not obfuscate, however, the fact that equation(4.6) provides a relation between population statistics given by x, F1 and F2, and param-eter λ = T1/T2, which contains functional information related to a gene’s eﬀect on anorganism’s ability to survive, rather than on its reproductive ﬁtness.5 OutlookWe have shown that diﬀerentiating the lifetimes of two phenotypes independently fromtheir fertility leads to a qualitative change in the equilibrium state of a population: sincesurvival and reproduction are quite distinct macro-functions performed by any livingorganism, this contributes to extend the population-genetical characterisation of biologicalfunction.28
- 29. We have furthermore shown that, by using information provided by neutral variation,the lifetime ratio λ can be expressed explicitly in terms of statistically observable quan-tities, and independently of all other parameters. This both gives some support to thepossible empirical relevance of the proposed modelling approach, and suggests observablequantities that can be useful in characterising the stochastic equilibrium of a populationin terms of the functional features of the individuals which comprise it.It needs to be stressed, however, that the statistical resolution needed to estimate λeﬃciently following this method seems to go beyond what could be achieved empirically: inorder to obtain Figure 6, 10000 realisations of the system were needed for each parametervalue, and for each value 5000 generations were needed for the population to relax to itsstochastic equilibrium.This study aims to be a proof of principle, and should only be considered a worstcase scenario, which nevertheless shows that inferring functional details from populationgenetical considerations is a deﬁnite theoretical possibility. It is left for a future workto assess its practical feasibility by improving the estimation eﬃciency, possibly whileconsidering dynamical statistics explicitly: it is useful to remember, however, that thedynamics of no system has ever been understood without a suﬃcient grasp of how relevantforces balance one another to allow observation.Acknowledgments An acknowledgment is due to Henrik Jeldtoft Jensen for drafting anecological model which partially inspired the one presented here. The research was sup-ported by a Marie Curie Intra European Fellowship within the 7th European CommunityFramework Programme.References[1] A. Berry, M. Kreitman, (1993), Molecular analysis of an allozyme cline: alcoholdehydrogenase in Drosophila melanogaster on the east coast of North America, Ge-netics 134, 869-93.29
- 30. [2] K. G. Binmore, L. Samuelson, V. Richard, (1995), Musical Chairs: Modeling NoisyEvolution, Games and economic behavior 11, 1-35.[3] C. Cannings, (1973), The equivalence of some overlapping and non-overlapping gen-eration models for the study of genetic drift, Journal of Applied Probability 10(2),432-436.[4] J. F. Crow, M. Kimura, (2009), An introduction to population genetics theory(reprint of 1970 edition by Harper and Row), The Blackburn Press, New Jersey.[5] D. J. Futuyma, (2009), Evolution 2nd ed, Sinauer Associates Inc, Sunderland MA.[6] J. H. Gillespie, (2004), Population Genetics: A Concise Guide 2nd ed, Johns Hop-kins University Press, Baltimore.[7] D. L. Hartl, A. G. Clark, (2007), Principles of Population Genetics 4th ed, SinauerAssociates Inc, Sunderland MA.[8] M. Kimura and J. F. Crow, (1964), The number of alleles that can be maintainedin a ﬁnite population, Genetics 49, 725-738.[9] M. Kreitman, (1983), Nucleotide polymorphism at the alcohol dehydrogenase locusof Drosophila melanogaster, Nature 304, 412-417.[10] P. A. P. Moran, (1958), Random processes in genetics, Mathematical Proceedingsof the Cambridge Philosophical Society 54, 60-71.[11] P. A. P. Moran, (1958), A general theory of the distribution of gene frequencies. I.Overlapping generations, Proc. R. Soc. Lond. B 149, 102-112.[12] R. Nielsen, (2005), Molecular signatures of natural selection, Annu. Rev. Genet. 39,197-218.[13] S. A. Sawyer, D. L. Hartl, (1992), Population genetics of polymorphism and diver-gence, Genetics 132, 1161-76.30
- 31. [14] S. A. Sawyer, L. I. Wu, M. Emerman, H. S. Malik, (2005), Positive selection of pri-mate TRIM5 alpha identiﬁes a critical species-speciﬁc retroviral restriction domain,Proc. Natl. Acad. Sci. USA 102, 2832-37.[15] S. H. Williamson, R. Hernandez, A. Fledel-Alon, L. Zhu, R. Nielsen, and C. D.Bustamante, (2005), Simultaneous inference of selection and population growth frompatterns of variation in the human genome, Proc. Natl. Acad. Sci. USA 102(22),7882-7887.[16] S. Wright, (1937), The distribution of gene frequencies in populations, Proc. Natl.Acad. Sci. USA 23: 307-320.31