Data Science

Python is a top language for data science. Now that you know how to program with Python, you're going to use Python to analyze data. In this course, you will learn to use the most important libraries for analyzing data in Python, including NumPy, Pandas, and Klearn. These Python libraries are integral to becoming a data scientist and used in industries such as healthcare, finance, insurance, all things Internet, and many others. In addition to analyzing data, you'll learn how to build data pipelines and build your own machine learning model for making predictions on real world data set. In module one, you'll learn how to understand the data set characteristics, get an overview of Python packages for analyzing data, and learn how to import data and start analyzing it. In module two, I will teach you data wrangling preprocessing, dealing with missing values, followed by data formatting and data normalization. In module three, you'll be introduced to exploratory data analysis, descriptive statistics GroupBy, correlation, and other important statistics. In module four, you will learn linear regression, model evaluation, polynomial regression, and pipelines measures for In-sample evaluation, prediction and decision making. In module five, you will discover model evaluation and refinement overfitting, underfitting, model selection, ridge regression, and grid search. Finally, you will practice your newly acquired skills with a Hands-On project using a real world data set. The only prerequisites for this course are programming with Python and high school math. Good luck and have fun. 2.Python Packages for Data Science In order to do data analysis in Python, we should first tell you a little bit about the main packages relevant to analysis in Python. A Python library is a collection of functions and methods that allow you to perform lots of actions without writing any code. The libraries usually contain built-in modules providing different functionalities which you can use directly. There are extensive libraries offering a broad range of facilities. We have divided the Python data analysis libraries into three groups. 2.1 Scientific Computing Libraries(First group). Pandas offers data structure and tools for effective data manipulation and analysis. It provides fast access to structured data. The primary instrument of Pandas is a two dimensional table consisting of column and row labels, which are called a data frame. It is designed to provide easy indexing functionality. The NumPy library uses arrays for its inputs and outputs. It can be extended to objects for matrices, and with minor coding changes, developers can perform fast array processing. SciPy includes functions for some advanced math problems as listed on this slide, as well as data visualization. 2.2 Data visualization(Second method) This methods is the best way to communicate with others, showing the meaningful results of analysis. These libraries enable you to create graphs, charts, and maps. The Matplotlib lib package is the most well-known library for data visualization. It is great for making graphs and plots. The graphs are also highly customizable. Another high-level visualization library is Seaborn. It is based on Matplotlib. It's very easy to generate various plots, such as heat maps, time series, and violin plots. 2.3 Machine learning algorithms(third group) we're able to develop a model using our data set and obtain predictions. tackle some machine learning tasks from basic too complex. Here we introduce two packages. The Scikit-learn library contains tools, for statistical modeling, including regression, classification, clustering, and so on. This library is built on NumPy, SciPy, and Matplotlib. Statsmodels is also a Python module that allows users to explore data, estimate statistical models, and perform statistical tests. 2.4.Importing and Exporting Data in Python Data acquisition is a process of loading and reading data into notebook from various sources. To read any data using Python's Pandas package, there are two important factors to consider: format and file path. Format is the way data is encoded. We can usually tell different encoding schemes by looking at the ending of the file name. Some common encodings are CSV, JSON, XLSX, HDF, and so forth. The path tells us where the data is stored. Usually it is stored either on the computer we are using, or online on the internet. In our case, we found a data set of used cars which was obtained from the web address shown on the slide. When Jerry entered the web address in his web browser, he saw something like this. Each row is one data point. A large number of properties are associated with each data point. Because the properties are separated from each other by commas, we can guess the data format is CSV, which stands for comma-separated values. At this point, these are just numbers, and don't mean much to humans. But once we read in this data, we can try to make more sense out of it. In Pandas, the read_csv method can read in files with columns separated by commas into a Pandas data frame. Reading data in Pandas can be done quickly in three lines. First, import Pandas, then define a variable with a file path. And then use the read_csv method to import the data. However, read_csv assumes the data contains a header. Our data on used cars has no column headers. So we need to specify read_csv to not assign headers by setting header to none. 2.4.1 Three Steps to read data in panda library python Code Import pandas as pd url=”https://archive.ics.uci.edu/m1/machine-learning-databases/autos/imports-85.data” df=pd.read_csv(url, header=none) After reading the data set, it is a good idea to look at the data frame to get a better intuition and to ensure that everything occurred the way you expected. Since printing the entire data set may take up too much time and resources, to save time, we can just use df is used to print complete dataframe(not recommended for bog dataframe) df.head(n) to show the first n rows of the data frame. df.tail(n)shows the bottom n rows of data frame. Here we printed out the first five rows of data. It seems that the data set was read successfully. We can see that Pandas automatically set the column header as a list of integers, because we set header= none, when we read the data. It is difficult to work with the data frame without having meaningful column names. However, we can assign column names in Pandas. In our present case, it turned out that we have the column names in a separate file online. We first put the column names in a list called headers. 2.4.2 Adding headers Code Headers=[“price”,”functions”,”automobile”,’automatic”,”engine power”] df.columns=headers Then, we set df.columns=headers to replace the default integer headers by the list. If we use the head method introduced in the last slide to check the data set, we see the correct headers inserted at the top of each column. 2.4.3 Exporting to different format in python At some point in time, after you've done operations on your data frame, you may want to export your Pandas data frame to a new CSV file. You can do this using the method to_csv. To do this, specify the file path, which includes the file name that you want to write to. For example, if you would like to save data frame df, as automobile.csv to your own computer, you can use the syntax df.2_csv. For this course, we will only read and save CSV files. However, Pandas also supports importing and exporting of most data file types with different data set formats. The code syntax for reading and saving other data formats is very similar to read or save CSV file. Each column shows a different method to read and save files into a different format. 2.5 Getting Started Analyzing Data in Python In this topic we introduce some simple Pandas methods that all data scientists and analysts should know when using Python, Pandas, and data. At this point, we assume that the data has been loaded. It's time for us to explore the dataset. 2.5.1 Basic insights from the data • Understand your data before you begin any analysis. • Should check: o Data type o Data distribution o Local potential issue with the data Pandas has several built-in methods that can be used to understand the datatype or features or to look at the distribution of data within the dataset. Using these methods, gives an overview of the dataset and also point out potential issues such as the wrong data type of features which may need to be resolved later on. 2.5.2 Data type comparison Why check data types? • Potential issue and type mismatch • Compatibility with python method Data has a variety of types. The main types stored in Pandas' objects are object, float, Int, and datetime. The data type names are somewhat different from those in native Python. This table shows the differences and similarities between them. Some are very similar such as the numeric data types, int and float. The object pandas type function's similar to string in Python, save for the change in name. While the datetime Pandas type, is a very useful type for handling time series data. There are two reasons to check data types in a dataset. Pandas automatically assigns types based on the encoding it detects from the original data table. For a number of reasons, this assignment may be incorrect. For example, it should be awkward if the car price column which we should expect to contain continuous numeric numbers, is assigned the data type of object. It would be more natural for it to have the float type. Jerry may need to manually change the data type to float. The second reason, is that allows an experienced data scientists to see which Python functions can be applied to a specific column. For example, some math functions can only be applied to numerical data. If these functions are applied to non-numerical data an error may result. 2.5.3 To check data type in form of list In pandas we use dataframes.dtypes to check data type. df.dtypes When the dtype method is applied to the data set, the data type of each column is returned in a series. A good data scientists intuition tells us that most of the data types make sense. They make of cars for example are names. So, this information should be of type object. The last one on the list could be an issue. As bore is a dimension of an engine, we should expect a numerical data type to be used. Instead, the object type is used. In later sections, Jerry will have to correct these type mismatches. Now, we would like to check the statistical summary of each column to learn about the distribution of data in each column. The statistical metrics can tell the data scientist if there are mathematical issues that may exist such as extreme outliers and large deviations. The data scientists may have to address these issues later.