Averages

Oct. 3rd, 2005 05:52 pm
yrieithydd: Classic Welsh alphabet poster. A B C Ch D Dd E F FF G Ng H I L LL M N O P Ph R Rh S T Th U W Y (Wyddor)
[personal profile] yrieithydd
Mmmm, reading about statistics has reminded me that saying `average' is not technically as there are (at least?) three measures of the average in statistics. Thus I decided that I ought to avoid using the word and explain why to my non-technical audience. In doing this, I found myself confused about what the thing I was referring to as `the average' actually was.

I think I'm being confused because I'm dealing with frequencies. The data that I am analysing are the figures for occurrences of my preposition(s) across the texts of my corpus. My thirteen texts are very different lengths, ranging from 1967 words to 66608 words, though those are extremes, the second longest is 15804 words and the second shortest 3079 words. I have produced the frequencies by dividing the number of occurrences by the number of words and multiplying by 1000 which gives the number of occurrences per 1000 words. I have done this for each of the texts in turn and for the corpus as a whole. It is this figure I have described as the `average for the corpus'. This obviously is not the median or the mode (neither of which make any sense for my data AFAICS) but is it the mean? It was not found by adding up the totals of a set of figures and dividing it by the number of examples. Am I right to just call it the average?

Date: 2005-10-03 05:44 pm (UTC)
From: [identity profile] atreic.livejournal.com
It's the mean

Date: 2005-10-03 05:47 pm (UTC)
From: [identity profile] atreic.livejournal.com
Imagine the occurance of the preposition is a pink cow, and 1000 words is a field of 1000 cows. You're trying to find the "average" number of pink cows in each field. You have taken the total number of pink cows (the total of a set of figures) and divided it by the number of fields. What is confusing is that you have multiplied by 1000 / number of (words==all cows), rather than clearly dividing by the number of fields, but number of cows / 1000 cows in a field *is* the number of fields, so multiplying by that is completely equivalent to dividing by the number of fields

Date: 2005-10-03 06:11 pm (UTC)
From: [identity profile] yrieithydd.livejournal.com
But that analogy lacks the the distinction of the texts. Surely, the fields are the texts with a varying number of cows in them, some of which are pink (my preposition). I have added up the number of pink cows and the total number of cows(/1000) and worked it out. Intuitively, I thought it was the mean, but then got confused when trying to explain the concept of mean (generally) as to how it applied to my actual figures. But I'm not sure I can explain my confusion!

Date: 2005-10-03 08:35 pm (UTC)
mair_in_grenderich: (Default)
From: [personal profile] mair_in_grenderich
the pink cows confused me, although I did go and look in google for pictures of pink cows :-)

Personally, I'd just write down a list of equations and cancel things to see what I had, and I can't work out what it *is* you're asking about.

Is it just (NumPreps/NumWords) * 1000? Or are you summing over the results from each of the separate documents?

Date: 2005-10-04 10:51 am (UTC)
From: [identity profile] atreic.livejournal.com
The texts arn't the distintion you're making. Imagine the texts are "farmers who own cows", who just happen to all put their cows in fields of a thousand cows because it's easy. You *could* find the mean number of pink cows owned by a farmer, by adding up all the pink cows and dividing them by the number of farmers (texts) but you *haven't* - you've found the mean number of cows in a field, and whose cows they were hasn't been considered. "Mean no of pink cows in a field" is *a* mean, that says something about pink cows, not *the* mean - you could have mean number of pink cows per farmer, or mean number of pink cows per milking shed, or...

Date: 2005-10-04 10:52 am (UTC)
From: [identity profile] atreic.livejournal.com
oh, err, what Jack said. Far more concisely and clearly :)

Date: 2005-10-04 12:23 pm (UTC)
From: [identity profile] cartesiandaemon.livejournal.com
Oh wow, thank you. Normally *I* say that to *you*! :) Though I'm not sure my explanation was particularly clear.

OK, the word I couldn't think of before is *density*. Again, I am without technical vocabulary, but you're measuring the percentage of propositions, or how concentrated the preps are, not how many they are total.

There is an arbiatry scaling, because we're saying "X per 1000", instead of "X/10 %" or "X/100 words prefer prepositions" or "If the works were the same length..." all of which are interchangable, and we picked an easy one to describe[1].

[1] Imagine a poor linguistics professor reading this and muttering querelously "0.0214 of a preposition? What?" :)

Date: 2005-10-05 09:59 am (UTC)
From: [identity profile] yrieithydd.livejournal.com
[1] Imagine a poor linguistics professor reading this and muttering querelously "0.0214 of a preposition? What?" :)

This is why I decided to go for per 1000 words, that way we get some whole numbers for our answers! And saying that 0.0004 of each word is a preposition makes no sense. Saying that in a 1000 words, you'd expect 9.34 of them or even 0.003 of them to be prepositions is much more sensible.

Date: 2005-10-05 10:03 am (UTC)
From: [identity profile] yrieithydd.livejournal.com
Aah, that begins to make sense. I could say say that the mean per farmer was 981/13 but that would make little sense because the farmers own such vastly different numbers of cows it is not a helpful figure. Hence wanting it per field. This figure can then be compared with the figures for each farmer!

Date: 2005-10-05 10:25 am (UTC)
From: [identity profile] atreic.livejournal.com
Yes, almost.

Concrete example:
fields contain 1000 cows
Farmer Abbort owns 5000 cows, of which 500 are pink
Farmer Bloggs owns 500 cows, of which 250 are pink*
Farmer Cooke owns 500 cows, of which 50 are pink

So we have three farmers, and six fields

The mean number of cows per farmer is 2000
The mean number of pink cows per farmer is 266 (2/3)
The mean number of cows per field is 1000 (by definition of a field :) )
The mean number of pink cows per field is 133 (1/3)

but what you want to do is compair not the mean number of pink cows per farmer with the number in a field, but the mean number of pink cows that belong to farmer A/B/C in a field with each other, and with the mean number of all pink cows in a field.

So farmer Abbot owns 5000 cows, = 5 fields,
So the mean number of farmer abbots pink cows in a field is 100
Farmer Bloogs only owns 1/2 a field of cows
The mean number of farmer Bloggs pink cows in a field is 250/(1/2) = 500
The mean number of farmer Cookes pink cows in a field is 100

So you can conclude that
a) farmers cooke and bloggs have fewer cows in total than on average (they are shorter texts)
b) *important thing not to get confused about* - it looks from finding the mean per farmer that farmer bloggs has fewer pink cows than the mean number of pink cows per farmer. This is True, but it's not The Point - farmer bloggs has far fewer cows than farmer abbot, and so has far fewer pink cows as a consequence of this!
c) Looking at the mean number of pink cows per field of farmer A/B/C (ie looking at the mean number of occurances per thousand words in a given text) lets you see that farmers Abbot and Cooke have exactly the same frequency of pink cows in their cows. It also shows you that farmer bloggs has a huge number of pink cows!

Looking at things per field (per thousand words) is just the same as looking at them per cow (per word) with the exception of a scaling factor, and the fact it gives you less decimal numbers :)

*That would be a Very Strange Text!



Date: 2005-10-03 08:37 pm (UTC)
emperor: (Default)
From: [personal profile] emperor
It's the mean for the corpus. Because if you took this value, divided it by 1000 (number of occurrences per word-of-corpus), and multiplied it by the number of words in the corpus, you'd get the number of times the preposition occured across the entire corpus.

Date: 2005-10-04 08:55 am (UTC)
From: [identity profile] cartesiandaemon.livejournal.com
Oops, I'm not sure now. Definitely mean, but could be slightly confusing devoid of context, because it's the mean per 1000 words, rather than eg. mean number of words per book.

Average actually isn't so bad, because it can mean 'the average we're using now, ok?' :)

Date: 2005-10-05 10:01 am (UTC)
From: [identity profile] yrieithydd.livejournal.com
Average actually isn't so bad, because it can mean 'the average we're using now, ok?' :)

So I can get away with using it and not have to explain different sorts of averages to my non-technical audience?

Date: 2005-10-05 11:33 am (UTC)
From: [identity profile] cartesiandaemon.livejournal.com
Probably. People might not necessarily get what's really going on (see our confusion in this thread :)) but probably get the idea that it's "how much preps are used", though you may get some people going "Wait, what's this an average of? I'm confused now..."

Actually, maybe you were right in the post, and should say 'frequency'.

Profile

yrieithydd

May 2023

S M T W T F S
 1234 56
78910111213
14151617181920
21222324252627
28293031   

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated May. 19th, 2025 11:20 pm
Powered by Dreamwidth Studios