![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
Mmmm, reading about statistics has reminded me that saying `average' is not technically as there are (at least?) three measures of the average in statistics. Thus I decided that I ought to avoid using the word and explain why to my non-technical audience. In doing this, I found myself confused about what the thing I was referring to as `the average' actually was.
I think I'm being confused because I'm dealing with frequencies. The data that I am analysing are the figures for occurrences of my preposition(s) across the texts of my corpus. My thirteen texts are very different lengths, ranging from 1967 words to 66608 words, though those are extremes, the second longest is 15804 words and the second shortest 3079 words. I have produced the frequencies by dividing the number of occurrences by the number of words and multiplying by 1000 which gives the number of occurrences per 1000 words. I have done this for each of the texts in turn and for the corpus as a whole. It is this figure I have described as the `average for the corpus'. This obviously is not the median or the mode (neither of which make any sense for my data AFAICS) but is it the mean? It was not found by adding up the totals of a set of figures and dividing it by the number of examples. Am I right to just call it the average?
I think I'm being confused because I'm dealing with frequencies. The data that I am analysing are the figures for occurrences of my preposition(s) across the texts of my corpus. My thirteen texts are very different lengths, ranging from 1967 words to 66608 words, though those are extremes, the second longest is 15804 words and the second shortest 3079 words. I have produced the frequencies by dividing the number of occurrences by the number of words and multiplying by 1000 which gives the number of occurrences per 1000 words. I have done this for each of the texts in turn and for the corpus as a whole. It is this figure I have described as the `average for the corpus'. This obviously is not the median or the mode (neither of which make any sense for my data AFAICS) but is it the mean? It was not found by adding up the totals of a set of figures and dividing it by the number of examples. Am I right to just call it the average?
no subject
Date: 2005-10-03 05:44 pm (UTC)no subject
Date: 2005-10-03 05:47 pm (UTC)no subject
Date: 2005-10-03 06:11 pm (UTC)no subject
Date: 2005-10-03 08:35 pm (UTC)Personally, I'd just write down a list of equations and cancel things to see what I had, and I can't work out what it *is* you're asking about.
Is it just (NumPreps/NumWords) * 1000? Or are you summing over the results from each of the separate documents?
no subject
Date: 2005-10-04 10:51 am (UTC)no subject
Date: 2005-10-04 10:52 am (UTC)no subject
Date: 2005-10-04 12:23 pm (UTC)OK, the word I couldn't think of before is *density*. Again, I am without technical vocabulary, but you're measuring the percentage of propositions, or how concentrated the preps are, not how many they are total.
There is an arbiatry scaling, because we're saying "X per 1000", instead of "X/10 %" or "X/100 words prefer prepositions" or "If the works were the same length..." all of which are interchangable, and we picked an easy one to describe[1].
[1] Imagine a poor linguistics professor reading this and muttering querelously "0.0214 of a preposition? What?" :)
no subject
Date: 2005-10-05 09:59 am (UTC)This is why I decided to go for per 1000 words, that way we get some whole numbers for our answers! And saying that 0.0004 of each word is a preposition makes no sense. Saying that in a 1000 words, you'd expect 9.34 of them or even 0.003 of them to be prepositions is much more sensible.
no subject
Date: 2005-10-05 10:03 am (UTC)no subject
Date: 2005-10-05 10:25 am (UTC)Concrete example:
fields contain 1000 cows
Farmer Abbort owns 5000 cows, of which 500 are pink
Farmer Bloggs owns 500 cows, of which 250 are pink*
Farmer Cooke owns 500 cows, of which 50 are pink
So we have three farmers, and six fields
The mean number of cows per farmer is 2000
The mean number of pink cows per farmer is 266 (2/3)
The mean number of cows per field is 1000 (by definition of a field :) )
The mean number of pink cows per field is 133 (1/3)
but what you want to do is compair not the mean number of pink cows per farmer with the number in a field, but the mean number of pink cows that belong to farmer A/B/C in a field with each other, and with the mean number of all pink cows in a field.
So farmer Abbot owns 5000 cows, = 5 fields,
So the mean number of farmer abbots pink cows in a field is 100
Farmer Bloogs only owns 1/2 a field of cows
The mean number of farmer Bloggs pink cows in a field is 250/(1/2) = 500
The mean number of farmer Cookes pink cows in a field is 100
So you can conclude that
a) farmers cooke and bloggs have fewer cows in total than on average (they are shorter texts)
b) *important thing not to get confused about* - it looks from finding the mean per farmer that farmer bloggs has fewer pink cows than the mean number of pink cows per farmer. This is True, but it's not The Point - farmer bloggs has far fewer cows than farmer abbot, and so has far fewer pink cows as a consequence of this!
c) Looking at the mean number of pink cows per field of farmer A/B/C (ie looking at the mean number of occurances per thousand words in a given text) lets you see that farmers Abbot and Cooke have exactly the same frequency of pink cows in their cows. It also shows you that farmer bloggs has a huge number of pink cows!
Looking at things per field (per thousand words) is just the same as looking at them per cow (per word) with the exception of a scaling factor, and the fact it gives you less decimal numbers :)
*That would be a Very Strange Text!
no subject
Date: 2005-10-03 08:37 pm (UTC)no subject
Date: 2005-10-04 08:55 am (UTC)Average actually isn't so bad, because it can mean 'the average we're using now, ok?' :)
no subject
Date: 2005-10-05 10:01 am (UTC)So I can get away with using it and not have to explain different sorts of averages to my non-technical audience?
no subject
Date: 2005-10-05 11:33 am (UTC)Actually, maybe you were right in the post, and should say 'frequency'.