Tech and Media Labs
This site uses cookies to improve the user experience.




R - Load Data

Jakob Jenkov
Last update: 2015-11-20

R is a programming language designed for data analysis. Therefore loading data is one of the core features of R.

R contains a set of functions that can be used to load data sets into memory. You can also load data into memory using R Studio - via the menu items and toolbars. In this tutorial I will cover both methods.

Which method of loading data in R you should use depends on what you are doing. If you are just playing around with some data, using the R Studio menu items might be fine. But if you are writing an R program that needs be repeated for many different data sets, it might be better to write the loading of data as R program statements.

Data Formats

R can load data in two different formats:

  • CSV files
  • Text files

CSV means Comma Separated Values. You can export CSV files from many data carrying applications. For instance, you can export CSV files from data in an Excel spreadsheet. Here is an example of how a CSV file looks like inside:

name,id,salary
"John Doe",1,99999.00
"Joe Blocks",2,120000.00
"Cindy Loo",3,150000.00

As you can see, the values on each line are separated by commas. The first line contains a list of column names. These column names tell what the data in the following lines mean. These names only make sense to you. R does not care about these names. R just uses these name to identify data from the different columns.

A text file is typically similar to a CSV file, but instead of using commas as separators between values, text files often use other characters, like e.g. a Tab character. Here is an example of how a text file could look inside:

name            id      salary

"John Doe"      1       99999.00
"Joe Blocks"    2       120000.00
"Cindy Loo"     3       150000.00

As you can see, the data might be easier to read in text format - if you look at the data directly in the data file that is. Once the data is loaded into R / R studio, there is no difference. You can look at the data in R Studio's tabular data set viewer, and then you cannot see the difference between CSV files and text files.

Actually, the name "text files" is a bit confusing. Both CSV files and text files contains data in textual form (as characters). One just uses commas as separator between the values, whereas the others use a tab character.

Load Data Via R Studio Menu Items

The easiest way to load data into memory in R is by using the R Studio menu items. R Studio has menu items for loading data in two different places. The first is in the toolbar of the upper right section of R Studio. This screenshot shows where the "Import Dataset" button is (look for the little mouse pointer "hand") :

Import Dataset button in R Studio upper right section

When you click the button you get this little menu:

Import Dataset button clicked - in R Studio upper right section

You can also import data from the top menu of R Studio. The next screenshot shows where the "Import Dataset" menu item is located in R Studio's top menu:

Import Dataset menu item - in R Studio's top menu

Text File or Web URL

As you can see in both the "Import Dataset" menu items, you can import a data set "From Text File" or "From Web URL". These two options refer to where you load the data from. "From Text File" means from a text file on your local computer. "From Web URL" means that you load the data from a web server somewhere on the internet.

Regardless of whether you choose "From Text File" or "From Web URL", R can load the file as either a CSV or text file. The location of the file has nothing to do with the data format used inside the file. Don't get confused by that. The menu item "From Text file" does not mean "text file format" (tab characters as separators). It just means "a file on your local computer". "From Local File" would probably have been a more informative text for this menu item.

Selecting Data Format

After you have chosen the location to load the file from, you will be shown a dialog like this:

Specify data format - dialog in R Studio

The select boxes (drop down boxes) allows you to specify different configurations about the data format of the file you are about to import. In the boxes on the right you can see two boxes. The top box shows you what the data file looks like. The bottom box shows you how R Studio interprets the data in the file based on the configurations chosen in the select boxes in the left side of the dialog. If you change the choices in the select boxes you will see that the bottom right box changes.

When you have selected all the configurations you need in the select boxes on the left, click the "Import" button. The data will now be loaded into R Studio.

Note that R Studio prints the R commands needed to load the data into the R console in the left side of R studio. You can copy these functions and use them to load data into R via R code.

After the Data is Loaded

After you have loaded the data into R Studio it will look similar to the screenshot below:

Screenshot of R Studio after data has been imported.

Notice that in the top right part of R studio a new data variable has turned up. This variable references the loaded data. Via this variable you can access the data set and execute various R functions on the data set, like calculating the mean value of a certain column etc.

Notice also that you can see the loaded data in the upper left section of R Studio.

If you look at the lower left part of R Studio, the console area, you can see that the command used to import the data was printed out to the console. You can use this command to load data via R, as an R command, instead of via the R Studio GUI.

Loading Data in R

As mentioned earlier you can also load data in the R programming language. In fact, R Studio translates its wizard into R function calls when importing data.

R has three different functions which can import data. These are:

  • read.table()
  • read.csv()
  • read.delim()

These functions are very similar to each other, so if you master one of them you will soon master the others. In fact, you can probably just use the read.table() function for all of your data imports. These 3 functions will be covered in the following sections.

read.table()

The R function read.table() function loads data from a file into a tabular data set (table) in memory. A tabular data set consists of rows and columns, just like a spreadsheet. Sometimes rows are also referred to as "records" and columns referred to as "fields" or "properties".

The read.table() function takes three parameters:

  • The file name of the file to load
  • A flag telling if the file contains a header line
  • The separator character used inside the file to separate the values of each row.

The parameters to read.table() are listed between the parentheses, separated with commas. Here is an example of loading a CSV file using read.table() in R:

read.table("data.csv", header=T, sep=";")

The first parameter is the path to the file to read. In the example above that is the "data.csv" part. This parameter should contain a path to the file to read. In the above example only the file name itself is shown. Then R expects to find the file in the same directory R is running from. If you want to specify the full path to the file, you can do so too. Here is an example of how that looks on Windows:

"d:\\data\\projects\\tutorial-projects\\r-programming\\data.csv"

Normally, Windows only uses a single backslash (the \ character) between directory names, but in programming languages it is normal to use the \ character as an escape character in strings (text variables). When a programming language sees a \ in a string it will normally look at the next character after the \ to determine what character to insert into the string. To actually insert a \ you will therefore often need two \ (\\) as shown above.

The same file path on a Mac or Linux machine could look like this:

"/data/projects/tutorial-projects/r-programming/data.csv"

Notice the use of / between directories instead of \, and notice that you only need a single / between the directories, because / is not an escape character.

The second parameter of read.table() is the header=T part. This tells the read.table() function whether the first line in the data file is a header line or not. A value of header=T or header=TRUE means that the first line is a header line. A value of header=F or header=FALSE means that the first line is not a header line.

By "header line" is meant whether the first line contains the column names, or if the first line already contains data. Look at this CSV file:

name;id;salary
John Doe;1;99999
Joe Blocks;2;120000
Cindy Loo;3;150000

Notice how the first row contains the column names for the data on the following rows.

The third parameter specifies what character inside the data file that is used to separate the different column values on each row. If you look at the CSV file contents above you can see that a semicolon (;) is used as separator. That is why the third parameter to the read.table() function call is sep=";" meaning that the separator character used in the data file is a semicolon.

To execute read.table() you type the commands shown in this section into the console part of R Studio and press the "Enter" key.

More read.table() Parameters

The read.table() function is very advanced and can take more parameters than I have shown above. To get a full list of the parameters, type

help("read.table")

into the R console in R Studio and press enter. In the lower right part of the R Studio window, R Studio will show you the help for the read.table() function.

Assigning the Data Set to a Variable

If you just type in this command:

read.table("data.csv", header=T, sep=";")

Then R Studio will load the data file and print its contents to the console. But the data set will not be kept in memory. Too keep the data set in memory so you can work with it, you have to assign it to a variable. You do so like this:

data = read.table("data.csv", header=T, sep=";")

Or like this:

data <- read.table("data.csv", header=T, sep=";")

The first word, data, is the name of the variable you want to assign the loaded data set to. You can freely choose the variable name (but not all characters are allowed). You can load multiple data sets into memory, and assign each data set to its own variable. Then you can access them separately during your analysis.

Whether you use the = or <- after the variable name doesn't matter. I prefer the = character because I am used to that from other programming languages. But some prefer the <- notation. The result is the same though.

read.csv()

The read.csv() function reads a CSV file into the memory. The read.csv() function takes 3 parameters, just like the read.table() function. Here is an example call to the read.csv() function:

data = read.csv("D:\\data\\data.csv", header=T, sep=";")

This example loads the CSV file located at D:\\data\\data.csv and assign it to the variable named data. The first line is a header line containing the names of the columns in the CSV file. This is specified by the second parameter header=T. The third parameter specifies that the separator character used inside the CSV file is ; (a semicolon).

read.delim()

The read.delim() function reads a CSV file into the memory, just like the read.csv() function. The read.delim() function takes 3 parameters, just like the read.table() function. Here is an example call to the read.delim() function:

data = read.delim("D:\\data\\data.csv", header=T, sep=";")

This example loads the CSV file located at D:\\data\\data.csv and assign it to the variable named data. The first line is a header line containing the names of the columns in the CSV file. This is specified by the second parameter header=T. The third parameter specifies that the separator character used inside the CSV file is ; (a semicolon).

Jakob Jenkov




Copyright  Jenkov Aps
Close TOC