Summary  Statistics The Art and Science of Learning from Data : Third Edition : Alan Agresti & Christine Franklin

1 statistics: the art and science of learning from data

what is data?the information we gather using experiments and surveys.

what is statistics?the science of learning from data.

what is design?design refers to planning how to obtain data on a problem of interest.

what is descriptive statistics?summarizing and analyzing the data, thats obtained.

what is inferencial statistics?making decisions and predictions based on the data for answering statistical questions.

what is probability?a framework for quantifying how likely various outcomes are.

the population is the set of all subjects of interest. the sample is the subjects of which data will be gathered.

what are inferential statistics?statistics obtained from methods of making decisions or predictions about a population, based on data obtained from a sample of that population.

what is a (sample) statistic?a numerical summary of a sample taken from the population.

what is a parameter?a numerical summary of the population.

what is random sampling?taking a sample of the population where each subject in the population has had the same chance of getting picked.

2 Exploring data with graphs and numerical statistics

what is a variable?any characteristic observed in a study

what are observations?the data values that we observe for a variable.

when is a variable categorial and when is it quantitative?it's categorial if the observations can be put in categories, and its's quantitative if the observations can be expressed in numbers.

what is the key characteristic of a quantitive variable?there has to be different magnitudes, or you need to be able to take an average of the variable.

when is a quantitive variable dicrete, and when is it continuous?it is descrete when it's possible values form a set of seperate numbers, it's continuous when the possible values form an interval.

what is the modal category?the category with the highest frequency.

what is the mode?the numerical value in a quantative variable that occurs the most.

what is the proportion?the frequency of a per category divided by total observations.

proportions and percentages are relative frequencies

what is a pareto chart?a bar graph in which the categories are ordened by their frequency from the highest to the lowest.

what is the pareto principle?a small subset of categories often contains most of the observations.

what is this? and what does it showa dot plot. a dot plot shows the frequency of all observations of a variable.

what is this and what does it show?this is a stemandleaf plot. it shows the frequencies of the observation. 16 seconds was the highest frequent observation.

what is a histogram?a bar graph that shows the (relative) frequencies of the observations of a quantative variable.

if the set of data is small which type of graph is usually preferred?the stemandleaf plot or the dot plot is usually preferred.

what is the distribution of data, or data distribution?the values the variable take and the frequency of each value in a graph. data distribution is often a histogram.

when is a distribution called unimodal, and when is it called bimodal?when a distribution has two distinct mounds (dalparabool maar dan histogram) it is called bimodal. when it has one distribution it's called unimodal (bergparabool maar dan histogram)

symmetric distribution is always unimodal.

is this distribution skewed to the left or to the right.this distribution is skewed to the left.

what are the tails of a distribution?the lowest, and highest values.

what is a time series?a data set collected over time.

what is a time plot?a graph of a time series.

what is a trend?a trend is a pattern in a tima plot, so increasing, or decreasing.

what is the mean?the centre of a distribution found by taking the average out of the observations, (gemiddelde nemen).

what is the median?the centre of distribution found by ordening the observations from small to large and then picking the middle value.

what are the properties of the mean? the mean is the balance point of the data; make a line where the data is ordened from small to large, the mean would balance out this line.
 the mean can be highly influenced by an outlier.
 the mean is pulled to the longer tail in a skewed distribution. 
what is an outlier?an observation that falls way out of line with the rest of the data.

an extremely large value out in the right hand tail will pull the mean to the right.

 a symmetric distribution means mean = median
 a skewed to the right distribution means mean > median
 a skewed to the left distribution means mean < median

what does the median being resistent to extreme observations mean?that the median doesn't change because of extreme values.

what is the mode?the value that occurs most frequently.

what is the range?the difference between the largest and the smallest observation. largest  smallest = range

what is the deviation of an observation?the difference between the observation and the mean.

what is the formula for deviation? x being observationx^x^

the sum of the deviations always equals zero.

what is the variance?an average of the squeres of the deviation.

what is the formula for the standard deviation?√((∑(x^x^)^2)/(n1))

voeg vragen van de 4e lecture hier toethanks neef.

what is described when we talk about the pth percentile?the pth percentile is a line in the distribution where p % of the distribution is lower than that line.
Waarom moet er worden getest of de b in de regressielijn afwijkt van 0?
We vragen: bestaat dit verband ook in de populatie
Hoe voer je de Chikwadraat toets uit:
Vergelijk de verwachte en de feitelijke frequentie met elkaar
Waar is de chikwadraat voor?
De chikwadraat word gebruikt om het verband tussen categorische variabelen
Waar gebruik je de Ftest voor?
Meerdere onafhankelijke variabelen.
De formule is df1/df2
waarin df1= k = aantal onafhankelijke variabelen
df2 = n(k+1)
Multiple regressie houdt in
Multiple regressie kijkt naar meerdere variabelen die te maken hebben met een verschijnsel. Denk aan geslacht en hoeveelheid uren op inkomen. Hierbij worden de toegevoegde variabelen constant gehouden.
Leg uit wat r²=0,4 betekend
De error als je de voorspelde Y gebruikt (met X in de formule) is 40% kleiner dan als je het gemiddelde van Y gebruikt (zonder X, Theoretisch) Dus 40% van de variantie in Y wordt verklaard door X (praktisch)
Wat houdt de kwadrantenmethode in?
SPSS wil de fouten zo klein mogelijk maken. Waardoor je een lijn krijgt die zo dicht mogelijk bij de werkelijkheid zit.
Residual sum of squares houdt in dat?
Je rekent elk residu uit, kwadrateert deze en telt ze bij elkaar op, zodat de uitkomst positief word.
Verschil positief residu en negatief residu?
Positief residu treed op wanneer Y groter is dan de voorspelling, negatief residu wanneer de Y kleiner is dan de voorspelling : e = y  ydakje
Wat is residu (prediction error)?
Het verschil tussen geobserveerde en voorspelde waarde