Main

Educational videos for Biostatistics

Chapters 1 and 2: Frequency Distributions and Measures of Central Tendency

Cynthia Stewart

2 months ago

foreign to keep us up to date I wanted to um just quickly create a little video for you I think I'll probably break this into two pieces maybe do central tendency and then do variability um and then I'll just upload these so that you can kind of map on the lecture content and what you're reading in the book with um our survey and the way that we're analyzing it so this particular is going to look at um I'm going to try to connect for you methods with some of the statistics that we use we're goin
g to talk about two kinds of statistics and then really focus our efforts on descriptive statistics in particular measures of central tendency like where is the middle and then variability how does it vary around the middle so very quickly there are two basic kinds of research that we conduct exploratory research involves a topic that we know basically nothing about and our goal is to gather some very rich detailed information to help us describe this phenomenon which no one knows about confirma
tory research on the other hand is conducted when we know something about this particular phenomenon and we are trying to make some predictions or describe cause and effect relationships um so today we're going to focus on descriptive statistics which means that it's a way that we are able to gather data and describe an abstract phenomenon sometimes we do that by just stating the number of people with this particular characteristic that we're interested in or it might mean that we're looking at
group differences in people who all have this characteristic in common inferential statistics is what we do with um with research where we know something about the phenomenon so inferential statistics I might be trying to predict the order of events so maybe I want to predict that variable a precedes variable B or I might be trying to actually say that variable a causes changes in variable B both of those would be inferential statistics and we can only compute inferential statistics if we have i
nterval or ratio data interval data is likert scale data because people are responding to these likert scales right right how true something is using a four-point scale like in our survey or ratio data would be scale data like the variable age in our survey ratio data has a true zero point interval data means that there are regular scalar intervals between each of these constructs and it might be that that seems kind of fluid but people really understand the difference and different people will
use a scale the same way so it is both valid and reliable across people and across time points people get the difference between you know completely false versus probably false I understand it you understand it and we understand it in the same way okay so we're not going to practice we're going to move right into measures the central tendency central tendency simply means where is the center of a distribution this slide we can see that more people endorsed Junior than any other of these categori
es that would then the mode mode equals most which category was most frequently endorsed in this particular example we've got score on a quiz and the number of students who endorsed this score and it looks like the majority of students endorse 39 so the most of the students endorsed that they got a 39 on this particular quiz let's hope that's not out of a hundred there are some issues with the mode it's sensitive to extreme scores you can have a distribution that actually has two uh modes where
people are endorsing two categories kind of the same amount or it could be that everybody endorses the category is the exact same and so there is no mode foreign therefore sometimes we use the median the median it means that you're going to order your scores from least to greatest and you pick the one that's right in the middle so if we were to look at this particular data right we would look and see okay where is the middle so these scores the smallest score is 34 the greatest score is 40. we'v
e got one two three four five six seven categories so what is the score right in the middle well it's 37 so 37 would be our median okay the last measure of central central tendency that we'll talk about is the mean so we've talked about the mode most median middle mean is the average now you'll see the average represented a couple of ways when you're reading scientific journal articles if you see the mean represented as an X bar that means that that is the average for a sample of data if you see
this crazy M called mu that means that it's the average for a population so let's talk about the difference between population and Sample really quick just in case you don't remember a population are all the people that have some characteristic in common that I'm interested in studying so let's say um all UHD college students who have consumed alcohol in the last year all right so we have maybe 15 000 students at uahd those that meet the criteria to be in my population would have had to have a
sip or more of alcohol and last 365 days but that might be a lot of people for me to try to assess so I might want to randomly select a subset of those people that subset of people who attend UHD and have consumed alcohol and last 365 days would be my sample so sample X bar population mu they're computed the same way you're going to add up the number of endorsements and divide it by the number of people in your sample so this crazy kind of e is called Sigma it represents this notion of summing s
omething so you're going to add things up the X in the numerator represents a score so it's the sum of scores and then it's divided by n the number of people either in your sample or your population so you add up all the scores you divide by the number of people in the population okay so here are some disadvantages related to the name um it can be very sensitive to extreme scores so if you have some scores that are kind of outliers hanging out at the far end of a distribution a frequency distrib
ution they can make your data look either larger or smaller than it really should be because the mean is influenced by those extreme scores you also may just not have a very interesting looking distribution maybe everybody has endorsed each of these response options equally and so it's sensitive to a distribution that really doesn't have a more popularly endorsed category so how do we go about calculating the mean I'm going to talk about that actually on the next slide but we can take a look at
it here so if I want to know the number of people in the population right we're going to sum the scores and divide by the number of scores to calculate a mean so if I have a sample of students and they had these five different categories to place themselves into one of these five religion options zero students looks like endorsed Buddhist we had two students who endorsed Protestant we had five students who endorsed Catholic we had four students who endorsed Jewish and we had one student who endo
rsed Muslim so to find the total number of people in the sample I would add up how many people endorsed categories so two for Protestant Plus five for Catholic right so two plus five is seven plus four people endorsed Jewish so seven plus four is eleven and then we had one person who endorsed Muslims so eleven plus one is twelve so we know there were a total of 12 people in this sample now how do I sum these scores well it looks like I had zero people who endorsed Buddhist so that's a zero then
I had two people who endorsed Protestant so two times this category number might be one so two times one is one I mean two times one is two then we've got Catholic which might have been a three so we would take this third category and say okay uh two times one is two and then this third category Catholic five people endorsed it so three times five is fifteen so we have two plus fifteen which would be 17. now Jewish is the fourth category and there were four people who endorsed it so that's 4 tim
es 4 is 16 so we have 17 plus 16. I can't do that in my hand so sorry and then we would add that to this last category this fifth category is five and it would be five times one which would be five so Seventeen plus sixteen plus one equals the sum of the scores divided by the total number of participants which we said was what 12. that's how you find the mean this is another example about finding the mean here are some great things about the mean let's introduce you to a symmetrical distribution
um a symmetrical distribution means if you drew a distribution on a sheet of paper and you folded that P that paper right in the center the right side of the distribution would look exactly like the left side of the distribution so it means that as you move from the center of the distribution there are equal numbers of people who are endorsing these categories As you move from the mean and fewer and fewer and fewer people on each side are endorsing the more extreme categories right so the fewes
t number of people if we look at this example hours of sleep fewest number of people endorsed that they got three and a half hours and similarly the fewest number of people endorsed that they got 11 and a half hours the most people endorse that they got seven and a half hours right so this would be the mode the most commonly endorsed it's also right in the middle of the distribution so it is also the mean and then seven and a half would also end up being the mean median and the mode so that's so
mething nice that's about a symmetrical distribution how often do we have symmetrical distributions well we have them a lot when we look at population level data because most people are kind of average they tend to endorse things in the middle and there are very few people who are kind of weirdos on the extremes um not that that always works when we're looking at categories but it works nicely when we're talking about frequency distributions with ratio data like number of hours left right that's
a ratio data there's true zero point and so there are going to be very few people who got zero out of hours of sleep and there's gonna be very very few people who got 24 hours of sleep and then there are going to be more and more and more people who are getting somewhere in the middle right okay so let's talk about asymmetrical distributions if you look at this distribution in yellow on the left hand side of the screen this is a some positively skewed distribution why do we say it's positively
skewed well we've got this hunk of extreme scores out here on the end right so there's a group of people who have endorsed these very large numbers right income in the thousands well it looks like there's five people whose family made 240 maybe six and a half people whose families made 220 000 or more maybe seven ish people who made 200 right but that those few extreme scores on the high end are going to skew our data and make it look like the mean is actually larger than it should be because it
's pulled by this these very large numbers in the extremities on the end the positive end of this distribution if you look at this light green distribution this is a negatively skewed distribution for similar reasons because you have a few people who have got these very low scores that they have endorsed and so the mean is going to look smaller than it probably should because we've got these kind of outliers who made very low scores while the majority of students were you know endorsing somethin
g around 80 to 100. so in a positively skewed distribution our median is going to look whatever is in the in the middle right it's going to be probably a better way of us assessing where the middle of that distribution is the mode is probably going to be impacted by this tallest bar but it may not really show us more where the center is but the mean is perhaps not going to be the best measure of the center in terms of a positively skewed distribution because it is pulled to this farther in a hig
her number because of these extreme scores the same of course is true with this negatively skewed distribution the median in a skewed distribution is probably the best measure of central tendency but we use the mean a lot in statistics okay so variability is simply the spread of scores around the mean particular example the smallest score looks like it's a 10. and the largest score is a 90. so 90 minus 10 would be 80. the range is 80. the other way that we can conceptualize um the range of or th
e dispersion of scores around amine is by looking at how each score deviates from the mean and the deviation of scores is kind of the starting point of any hand calculation that you would perform for descriptive or inferential statistics okay so I've got one two three four five six seven eight nine I've got nine students who are a sample from a classroom of let's say 30. so my sample size is nine and it looks like the scores on this test ranged from 10 to 90. so the range is 80 right 90 minus 10
. and if we wanted to figure out how much Theo's score of 20 deviated from the mean we would need to compute a deviation score for Theo so how do we do that well the first thing that we do is we find the average for these test scores so I would add 10 plus 20 plus 30 plus 40 plus 50 plus 60 plus 70 plus 80 plus 90 and I would divide that by the number of scores right n and then I would Define the deviation take the average and subtract it from these individuals scores so fifth 10 minus 50 is 40.
so Amy is 40 points below the mean Theo is a 20. so 20 minus 50 is 30. so Theo is 30 points below the mean here oh my God go away Siri um Henry got a 40. so 40 minus the average of 50 is 10. he has 10 points below the mean and so on when we're calculating a deviation score we do that for every single person in our sample oh my God after we compute a deviation score for every single student we're going to add them up hey what just happened well if I want information about how the scores vary aro
und the mean I've got a problem here what's happened well what's happened is in a symmetrical distribution if I add up the deviation scores in a symmetrical distribution they sum to zero because well Amy is 40 points below them ah hold on see that Amy is 40 points below the mean but then if we look at Lulu hey she's 40 points Above So 40 minus 40 is zero let's look at Theo well Theo is 30 points below the mean but Trisha down here is 30 points above the mean so 30 minus 30 is zero uh oh well we'
ve got Max who's 20 points below the mean but then we've got Pedro who ends up being 20 points above the mean 20 minus 20 is zero dang it so in this symmetrical distribution we have basically if we add up all of these deviation scores we have zero information about how these scores vary around the mean if we just sum them up so we perform a statistical manipulation so that we understand how these scores are varying or are dispersed around the mean so what we do is we Square each of these deviati
on scores so we started off by adding up all the scores and dividing by the number of people right and we took that mean and we subtracted the mean from each of these scores so remember we took Amy square of 10 subtracted it from the mean of 50 and found that her deviation score was a negative 40 she's 40 points below the mean but in order for us to be able to add up these deviation points and have them not sum to zero we perform this manipulation where we Square each of the deviation scores so
for negative 40 times negative 40 is 1600 remember when you square a number you simply multiply that number by itself right so Theo was negative 30 so I would say negative 30 times negative 30 well his squared deviation square is 900 and so on for the rest of the participants then we add up these squared deviations 1600 plus 900 plus 400 plus 100 plus 0 plus 100 plus 400 plus 900 plus sixteen hundred if I add up all of these squared deviations the sum of squares is six thousand now wait wait wai
t wait wait 6000 makes no sense how could our participants vary around the mean if the range of scores is somewhere between zero and a hundred oh no what the heck there's no way you could be 6 000 points when you can't make more than a hundred well we did something here right we squared all of these deviation points so that they would sum to a non-zero number so after we find the sum of squares to find the variance we are going to take the sum of squares and divide it by our sample size if you a
re looking at population level data you are going to calculate the sum of squares and you're going to divide that by the number of people in the population however if you're finding the variance for a sample a subset of people in order to find the variance you are going to take that sum of squares and divide it by n minus 1. well why would we do that think about what it means to have a fraction that you are dividing if you have a very large number in the numerator and a small number in the denom
inator when you solve it it's going to be large so let's take for example let's say our sum of squares is I want cheese 10. and we have two people in our sample 10 divided by two is five right so our sum of squares would end up being 5. but if I have a sample my sum of squares is 10 and then 2 minus 1 is 1. so when I take 10 and divide it by one oops it's ten it looks inflated right because because of the manipulation that we've performed we are building in Greater variability because inherently
when you have a smaller number of people in a sample than what you have in a giant population they are going to be less representative of all of the diversity that exists in a population of people right if I have a thousand people they're going to be all kinds of differences in a thousand people that are not represented in a group of a hundred so whenever we're finding the variance we try to kind of boost the amount of variance by subtracting the number of people in a sample by one so that the
number in the the variability that exists in the numerator the sum of squares has the opportunity to demonstrate the diversity that would exist in a population okay so that's why so to find the variance you take the sum of squares if it's a population you divide by n total number of people in the population if you've got sample data like what I had I had nine people so it would be the sum of squares that we found 6 000 divided by n minus one eight foreign so this is an example of how I found the
variance so we had the scores we found the mean then we subtracted each score from the mean and then we squared those deviation scores in this squared deviations and then we added them up to find the sum of squares right 6000 and then we took the sum of squares and we divided it by n minus 1 8. the variance is 750. well that's better than 6 000 but our test range from a 0 to a 100 you couldn't get more than 100 on his test so 750 is still whack-a-doodle it's still too much variation it doesn't
even fit within the parameters of the test the scores that you can get on the test ah so we need to do one more thing to correct for the manipulation that we performed by squaring these deviations what am I going to do that will be our standard deviation I don't have my phone or my calculator with me but you can do that okay and so we're going to stop here for right now and I will pick up with the next recording 150

Comments