Data Science Tutorial
Data science is the process of studying data to gain new knowledge from it. To do so, data scientists analyze the data in many different ways. This analysis typically involve performing calculations on the data, and / or visualizing the data and calculations in various ways.
Exactly what analysis is performed depends on the context of the data science process. For instance, a data science team studying sales data will perform different calculations than a data science team studying crime patterns.
Extracting Aggregate Information
Quite often data science projects (or data analysis projects) extract information from a data set which cannot be obtained by looking at each record in the data set itself. For instance, if your data set is a list of sales records, you might want to know the count of sales records (number of sales made), or the sum (in money) of the revenue of each sale. You might also be interested in the average amount for a sale etc.
All of these numbers can only be obtained by looking at the whole data set (or a sample of it). Such information is called "aggregate information" because the numbers are aggregated from multiple records in the data set.
Extracting Useful Information
You can extract a lot of information from a data set with data science techniques. However, far from all that information is useful. Anyone who has used Google Analytics for a longer period of time can attest to that. Google Analytics can extract a lot of information from your web traffic, but not all of that information is useful to everyone.
If you extract too much information you risk getting information overload, and the useful information gets lost among all useless information.
Starting With a Question
A good way to focus your data analysis on extracting useful information is to start with one or more questions you want answered by the data. Once you know what questions you want answered, you can narrow your data collection to only contain data that can answer those questions. Once the data is collected you can analyze it and hopefully get your answer.
Starting With a Thesis
Another good way to focus your data science project is to start with a thesis that you want to prove or disprove. A thesis is an assumption you believe to be true. Once you have formulated a thesis you can focus your project on collecting data from experiments which can prove or disprove that thesis.
Data Science Processes
Depending on the nature of your data science project, you project may follow different processes. Here I will briefly describe two similar, albeit not identical data science processes.
Starting your data science project with questions lead to the following mini process:
- Determine what questions you want answered.
- Determine what data needs to be collected to answer that data.
- If necessary, specify an experiment that could collect that data.
- Execute experiment and collect the data.
- Analyze the data.
- Obtains answers to the questions.
You may have to cycle through the process a few times before you have your answers.
You can also start a data science project with a thesis. Your data science project is then attempting to prove or disprove this thesis - the assumption. This is the process often used in Lean Startup projects, where data science is used to find out what the customers want. But, this process could also be used in standard scientific research projects to prove / disprove a scientific thesis. Starting your data science project with a thesis leads to the following mini process:
- Formulate thesis.
- Determine what data could prove or disprove the thesis.
- Specify an experiment that could collect that data.
- Execute experiment and collect data.
- Analyze the data.
- Prove or disprove thesis.
You may have to cycle through the process a few times before you have proved or disproved your thesis.
Business Intelligence - Data Science in Business
Business intelligence is a branch of data science used in business. The purpose of business intelligence is to extract information from data that enables a company to improve its business. For instance, information that enable the company to increase revenue or decrease costs. The data can be collected by the company itself, or be third party data obtained from research institutes, governments etc. Often a mix of data sources will be used.
Business intelligence varies from very simple to very advanced calculations. Even the smallest of businesses can benefit from a minimum of business intelligence. The benefits of business intelligence are usually biggest (in percentages) in the beginning, when going from no business intelligence to some business intelligence. As your knowledge of the business grows deeper and you have already optimized a lot, there will be less and less potential for optimization.
Data Science Techniques
Even though most data science projects are different they often use many of the same techniques. These techniques are typically a combination of math and computer science. Here are some of the more commonly used data science techniques:
- Mathematical Analysis
- Data Mining
- Artificial Intelligence
Over time this, and other tutorial trails here on tutorials.jenkov.com will explain these techniques.