the r book count data in tables-

15Count Data in TablesThe analysis of count data with categorical explanatory variables comes under the heading of contingency tables. The general method of analysis for contingency tables involves log-linear modelling, but the simplest contingency tables are often analysed by Pearsons chi-squared, Fishers exact test or tests of binomial proportions (see p. 365).15.1A two-class table of countsYou count 47 animals and find that 29 of them are males and 18 are females. Are these data sufficiently male-biased to reject the null hypothesis of an even sex ratio? With an even sex ratio the expected number of males and females is 47/2 = 23.5. The simplest test is Pearsons chi-squared in which we calculate2=?(observed expected)2expected.Substituting our observed and expected values, we get2=(29 23.5)2+ (18 23.5)2 23.5= 2.574 468.This is less than the critical value for chi-squared with 1 degree of freedom (3.841), so we conclude that thesex ratio is not significantly different from 50:50. There is a built-in function for this:observed |z|) (Intercept)3.15700.145921.64 | Chi |) 101.021e-14 210.11715-1-0.117150.73215There is no interaction between seed colour and seed shape (p = 0.732 15) so we conclude that the two traits are independent and the phenotypes are distributed 9:3:3:1 as predicted. The p value is slightly different because the ratios of the two dominant traits are not exactly 3:1 in the data: round to wrinkled is exp(1.089 04) = 2.971 42 and yellow to green is exp(1.157 02) = 3.180 441:summary(model2)Coefficients: EstimateStd. Errorz valuePr( | z |) (Intercept)4.600270.0901351.04Chi) 10 0.00000000 21 0.00079137 -1 -0.000791370.9776This shows very clearly that the interaction between caterpillar attack and leaf holing does not differ fromtree to tree (p = 0.977 56). Note that if this interaction had been significant, then we would have stopped the modelling at this stage. But it was not, so we leave it out and continue. What about the main question? Is there an interaction between aphid attack and leaf holing? To test this we delete the Caterpillar by Aphid interaction from the model, and assess the results using anova:model3 Chi) 120.0040853 210.00079141 0.0032940.9542There is absolutely no hint of an interaction (p = 0.954). The interpretation is clear: this work provides no evidence at all for induced defences caused by early season caterpillar feeding. But look what happens when we do the modelling the wrong way. Suppose we went straight for the interaction of interest, Aphid by Caterpillar. We might proceed like this:wrong Chi) 14550.19 25556.85 -1-6.6594 0.009864 *606THE R BOOKThe Aphid by Caterpillar interaction is highly significant (p = 0.01), providing strong evidence for induced defences. This is wrong! By failing to include Tree in the model we have omitted an important explanatory variable. As it turns out, and as we should really have determined by more thorough preliminary analysis, the trees differ enormously in their average levels of leaf holing:as.vector(tapply(Count,list(Caterpillar,Tree),sum)1/tapply(Count,Tree,sum) 1Tree1 0.01963439as.vector(tapply(Count,list(Caterpillar,Tree),sum)3/tapply(Count,Tree,sum) 2Tree2 0.08182241Tree2 has more than four times the proportion of its leaves holed by caterpillars. If we had been paying more attention when we did the modelling the wrong way, we should have noticed that the model containing only Aphid and Caterpillar had massive overdispersion, and this should have alerted us that all was not well.The moral is simple and clear. Always fit a saturated model first, containing all the variables of interest and all the interactions involving the nuisance variables (Tree in this case). Only delete from the model those interactions that involve the variables of interest (Aphid and Caterpillar in this case). Main effects are meaningless in contingency tables (they do nothing more than constrain the marginal totals), as are the model summaries. Always test for overdispersion. It will never be a problem if you follow the advice of simplifyingdown from a saturated model, because you only ever leave out non-significant terms, and you never delete terms involving any of the nuisance variables.15.7Quasi-Poisson and negative binomial models comparedThe data on red blood cell counts are read from a file:data | t |) (Intercept)0.181150.021678.360|z|) (Intercept)0.181150.021608.388 | Chi |) 10-5.329e-15 213.08230-1-3.082300.07915The interaction is not significant (p = 0.079), indicating similar gender by discipline relationships in the twoyear groups. We finish the analysis at this point because we have answered the question that we were asked to address.610THE R BOOK15.9Schoeners lizards: A complex contingency tableIn this section we are interested in whether lizards show any niche separation across various ecological factors and, in particular, whether there are any interactions for example, whether they show different habitat separati