Tech and Media Labs
This site uses cookies to improve the user experience.




R Data Frame Basics

Jakob Jenkov
Last update: 2015-11-25

An R data frame is a tabular data structure very similar to an Excel spreadsheet, or a table in a relational database. Actually, the name "Data Frame" is a bit confusing if you ask me. I would have preferred calling it a "Data Table". This R data frame tutorial will explain the basics of working with R data frames, such as getting the number of rows, columns, performing simple functions on R data frames etc.

Loading Data Frames

As you may already have see in the tutorial about loading data in R, R makes it easy to load data into memory so you can work with it. Actually, when you load a CSV file into memory the data in the CSV file is loaded into an R data frame.

For the remaining part of this R data frame tutorial I will assume that you have loaded a data frame into memory using the following command (except with the file path to point to where the file is stored on your computer):

data = read.table(file="data.csv", header=T, sep=";")

This command loads the CSV file into an R data frame, and assigns the data frame to the variable named data. Using the data variable you can now access the loaded data frame.

The file should contain the following data:

name;age;salary
John;35;99000
Joe;42;120000
Cindy;55;150000

summary()

The summary() data frame function prints a summary of a given R data frame. You refer to the data frame using its variable name. If you want to print a summary of the data frame loaded using the R command shown earlier in this tutorial, write this into the R console in R Studio:

summary(data)

When executed inside R Studio on the data frame with the data shown earlier, you will get the following output:

    name        age           salary
 Cindy:1   Min.   :35.0   Min.   : 99999
 Joe  :1   1st Qu.:38.5   1st Qu.:110000
 John :1   Median :42.0   Median :120000
           Mean   :44.0   Mean   :123333
           3rd Qu.:48.5   3rd Qu.:135000
           Max.   :55.0   Max.   :150000

As you can see, the summary() function gives you a variety of different information about the data frame. I will briefly explain what that information is.

The left-most column printed corresponds to the first column of the data frame - the column called name. The left-most column shows each distinctive value of the name column, along with a counter telling how many times each value occurred. Since the name column from the data example shown earlier only contains 3 names and all names are distinct from each other, the counter after each name only shows 1. If one name had occurred more than once in the name column, the counter would have shown how many times the name had occurred.

The second column printed shows a summary of the data from the column called age. The printed column shows the minimum, mean and maximum values found in the age column, along with the median, 1st and 3rd quartile values (statistical values).

The reason the second column printed contains different information than the first column printed is, that the columns in the R data frame from which they show a summary contain different types of data. The name column contains textual data. The age columns contains numerical data. It doesn't make sense to calculate the mean value of names.

The third printed column contains the same type of summary data as the second printed column. This is because the third column in the data frame, salary, contains numerical data too.

dim()

The dim() function returns the dimensions of a data structure in R. When called with the name of an R data frame as parameter, dim() will print the number of rows and columns (in that order) of the data frame.

For instance, if you have the data set shown earlier loaded into memory and assigned to a variable named data, you could obtain the dimensions of the loaded data frame with this command:

dim(data)

The output from this command would be:

[1] 3 3

The first 3 is the number of rows in the data frame, and the last 3 is the number of columns. The data set shown earlier in this r data frame tutorial contains 3 rows and 3 columns, remember?

Referencing Data Frame Columns

You can reference a column of an R data frame via the column name. If the data was loaded from a CSV file, the column name is the name given to that column in the first line (the header line) of the CSV file. If you look at the data set shown earlier in this tutorial, you will see that the CSV file contains three columns. The first line of the data set specifies the column names, which are name, age and salary.

To reference one of the name, age or salary columns you write the name of the variable pointing to the data frame, a dollar sign ($) and then the name of the column to reference. Here is how referencing the age column looks:

data$age

If you type the above R command into the R console in R Studio and press enter, R will print out the values of the age column to the console. Here is how the output looks:

[1] 35 42 55

The three values 25, 42 and 55 are the values listed in the age column of the data frame.

Simply printing out the values of a column is not always that useful. However, you can pass a data frame column to some R functions and thus have the given R function perform its calculation on the values in that column. For instance, here is how you can calculate the mean age of the data set shown earlier:

mean(data$age)

The output from this R command would be:

[1] 44

The mean age of the persons listed in the data set is 44.

Jakob Jenkov




Copyright  Jenkov Aps
Close TOC