浙江大学肖忠华语料库session5(外语学习)-

Making statistic claimsCorpus Linguistics Richard Xiao lancsxiaozgooglemail.comUpdate on assignmentsDeadline for submission (email submission) TBA The Harvard referencing style Assignment A Corpus study: introduction; synopsis / overview, critical review of data, method of analysis, conclusion etc; conclusions, bibliography CL2005: http:/www.corpus.bham.ac.uk/pclc/index.shtml CL2007: http:/www.corpus.bham.ac.uk/conference/proceedings.shtml UCCTS2008: http:/www.lancs.ac.uk/fass/projects/corpus/UCCTS2008Proceedings/ UCCTS2010: http:/www.lancs.ac.uk/fass/projects/corpus/UCCTS2010Proceedings/ Corpus tool: Introduction; description of the tool, its main features and functions; your critical evaluation of the tool: how well it does the jobs it is supposed to do; user interface, powerfulness, etc; conclusions; bibliography Assignment B Introduction; literature review; methodology; results and discussions; conclusions; bibliography Option B: A 3,500-word essay, similar to Assignment BOutline of the session Lecture Raw and normalised frequency Descriptive statistics (mean, mode, media, measure of dispersion) Inferential statistics (chi squared, LL, Fishers Exact tests) Collocation statistics Lab UCREL online LL calculator Xus LL calculator SPSSQuantitative analysis Corpus analysis is both qualitative and quantitative One of the advantages of corpora is that they can readily provide quantitative data which intuitions cannot provide reliably “The use of quantification in corpus linguistics typically goes well beyond simple counting” (McEnery and Wilson 2001: 81) What can we do with those numbers and counts?Raw frequency The arithmetic count of the number of linguistic feature (a word, a structure etc) The most direct quantitative data provided by a corpus Frequency itself does NOT tell you much in terms of the validity of a hypothesis There are 250 instances of the f*k swearword in the spoken BNC, so what? Does this mean that people swear frequently or infrequently when they speak?Normalized frequency in relation to what? Corpus analysis is inherently comparative There are 250 instances of the swearword in the spoken BNC and 500 instances in the written BNC Do people swear twice as often in writing as in speech? Remember the written BNC is 9 times as large as the spoken BNC When comparing corpora of different sizes, we need to normalize the frequencies to a common base (e.g. per million tokens) Normalised freq = raw freq / token number * common base The swearword is 4 times as frequent in speech as in writing Swearword in spoken BNC = 250 / 10 * 1 = 25 per million tokens Swearword in written BNC = 500 / 90 * 1 = 6 per million tokens but is this difference statistically significant?Normalized frequency The size of a sample may affect the level of statistical significance Tips for normalizing frequency data The common base for normalization must be comparable to the sizes of the corpora Normalizing the spoken vs. written BNC to a common base of 1000 tokens? Warning Results obtained on an irrationally enlarged or reduced common base are distortedDescriptive statistics Frequencies are a type of descriptive statistics Descriptive statistics are used to describe a dataset A group of ten students took a test and their scores are as follows 4, 5, 6, 6, 7, 7, 7, 9, 9, 10 How will you report the measure of central tendency of this group of test results using a single score? The mean The mean is the arithmetic average The most common measure of central tendency Can be calculated by adding all of the scores together and then dividing the sum by the number of scores (i.e. 7) 4+5+6+6+7+7+7+9+9+10=70/10=7 While the mean is a useful measure, unless we also knows how dispersed (i.e. spread out) the scores in a dataset are, the mean can be an uncertain guideThe mode and the median The mode is the most common score in a set of scores The mode in our testing example is 7, because this score occurs more frequently than any other score 4, 5, 6, 6, 7, 7, 7, 9, 9, 10 The median is the middle score of a set of scores ordered from the lowest to the highest For an odd number of scores, the median is the central score in an ordered list For an even number of scores, the median is the average of the two central scores In the above example the median is 7 (i.e. (7+7)/2)Measure of dispersion: range The range is a simple way to measure the dispersion of a set of data The difference between the highest and lowest frequencies / scores In our testing example the range is 6 (i.e. highest 10 lowest 4) Only a poor measure of dispersion An unusually high or low score in a dataset may make the range unreasonably large, thus giving a distorted picture of the datasetMeasure of dispersion: variance The variance measures the distance of each score in the dataset from the me