# Data Concepts

Jakob Jenkov |

Mathematical analysis is carried out on data. In this tutorial I will explain some central data concepts related to mathematical analysis including statistics and probability.

## Population

The term *population* refers to a data set that contains information about the whole entity being analyzed.

Example 1: If a company performs an employee satisfaction survey and all employees participate in the survey, the resulting data set of that survey is referred to as a population.

Example 2: If a country performs a political opinion poll and every single citizen in the country participates, then the resulting data set of that poll is referred to as a population.

## Sample

The term *sample* refers to a data set that contains information about a subset of the whole entity being analyzed.
A sample of the total population or a subset of the population, in other words.

Example 1: If a company performs an employee satisfaction survey and only 10% (or any subset) of the employees participate in the survey, then the resulting data set is referred to as a sample.

Example 2: If a country performs a political opinion poll and only 5% (or any subset) of the citizens of that country participates, then the resulting data set is referred to as a sample.

### Biased Sample

If a sample does not correctly represent the full population it is part of, that sample is called a biased sample. The statistics you can calculate from that sample will then be imprecise (biased).

Example 1: If a company performs an employee satisfaction survey but only asks employees in upper management then that sample of the employees is not representative of all employees in the company. The survey would not include any information about the satisfaction non-management employees.

## Parameter

In the context of statistics the term *parameter* describes a certain aspect or attribute of a population
("population" as defined above). For instance, the degree of education of all people in a country is a parameter
of that country.

A parameter is similar to a statistic (see the following section) except a parameter describes a full population whereas a statistic describes a sample.

## Statistic

A *statistic* describes a certain aspect or attribute of a sample. For instance, the degree of education
of people calculated based on a sample, not a full population.

As mentioned above, a statistic is similar to a parameter except a parameter describes an aspect or attribute of a full population whereas a statistic describes an aspect or attribute of a sample.

## Primary Data

The term *primary data* means data that you have collected yourself. Data that you intend to use for
mathematical analysis or data science.

## Secondary Data

The term *secondary data* means data that somebody else has collected with the intention of scrutinizing it
via mathematical analysis or data science.

## Quantitative Data

*Quantitative data* is data which is numerical, meaning you can perform calculations on them. For instance, the
age or number of years employed of employees, in an employee satisfaction survey is quantitative data.

## Qualitative Data

*Qualitative data* is data which is descriptive rather than numerical. For instance, in an employee satisfaction
survey you may ask employees to describe in words how they feel about working for the employer. This data is
not quantitative as it is not numerical. It is descriptive and therefore qualitative.

## Nominal Data

*Nominal data* is is a subcategory of qualitative data. Nominal data is data which falls into one of a set
of predefined categories. Examples of nominal data are gender of a person, hair color, eye color etc.

## Ordinal Data

*Ordinal data* is data which is also nominal data, but which can be ordered. For instance, cars being tested
for security might be given a number between 1 and 10 describing how secure they are. This enables us to order
the cars after their security level, but we cannot say that a car with security level 5 is only half as secure
as a car with security level 10.

The ordinal categorization (1 to 10) in this example cannot be translated to security risk directly. You might only have a 30% bigger chance of getting hurt in a car with a security level of 1 than in a car with security level 10. That depends on how the category from 1 to 10 was designed.

It also doesn't make sense to perform any additions, subtractions, multiplications or divisions of ordinal data. It does not make much sense to say that car A is 2 points better than car B on the security level. It depends on what the 2 points in difference represents (e.g. which missing security features of the car), and that might be different from car to car.

## Interval Data

*Interval data* is data which fall into a set of intervals. For instance, profit or temperature intervals.
Interval data is a subset of quantitative data. You can only use the mathematical operations of addition
and subtraction on interval level data. Not multiplication or division.

For instance, company A might have a twice as big profit as company B, but you cannot say from that that company A is twice as profitable in general. It depends on other factors like revenue, the amount of dollars invested in the company etc.

Another example is temperatures in fahrenheit. You can say that a temperature of 20 degrees is 10 degrees colder than a temperature of 30 degrees fahrenheit, but you cannot say that 30 degrees fahrenheit is 50% warmer than 20 degrees fahrenheit. The fahrenheit scale cannot be translated like that.

## Ratio Data

*Ratio data* - or ratio level data is where you can both order the observations, and you can perform all
operations like addition, subtraction, multiplication and division on ratio level data.

An example would be employee salary. You can order employees after salary. You can say that an employee A earns X more (or less) than employee B simply by subtracting one salary from the other. And you can say that employee A earns N percent more than employee B by dividing the two salaries by each other (and multiplying by 100 after).

Another example would be how fast a car can drive. You can say that car A can drive 10 km/h faster than car B, and you can say that car A can drive 2 times as fast as car C.

Tweet | |

Jakob Jenkov |