Here are 5 fundamentals you need to know before you start learning Data science (Part 1)
- Matrices and Linear Algebra Fundamentals
Statistics that study guesswork. It is thus a cornerstone of machine learning and artificial intelligence. Statistics use a variety of mathematical approaches to reach conclusions. Linear algebra is a branch of mathematics that is very useful in data science and machine learning. Most machine learning models can be expressed in matrix form. Data sets themselves are often represented as matrices. Linear algebra is used for data preprocessing, data transformation, and model evaluation.
- Database Basics
Data is the most important part of data science. You cannot have data science without data. Designing, building, and interacting with databases is essential for any data scientist to advance their careers and enrich their knowledge base. You should learn about relational and non-relational databases, SQL (primarily join operations), and NoSQL.
- Data Frames and Series
Pandas, a Python library primarily used in data science, allows you to store data in DataFrames (2D) and Series (1D). Series is a list type in pandas that can take integer values, string values, doubles, etc. A series can contain only one indexed list, whereas a dataframe can have multiple series. Or you could say a dataframe is a collection of series that you can use to analyze your data. Email me if you need cheat sheets for numpy and pandas.
- Extract, transform, load (ETL)
Data extraction involves extracting data from homogeneous or heterogeneous sources. Data transformation processes data by cleaning it and transforming it into a storage format/structure suitable for query and analysis purposes. Finally, data loading describes how to insert data into a final target database, such as an online data warehouse, data mart, data lake, or data warehouse. JSON, XML, and CSV are the three main data formats used in data science.
5 . Regular Expression(RegEx)
A regular expression is used to determine whether a pattern exists in a given sequence of characters (strings) and to determine the position of the pattern in the body of text. It supports text data manipulation, a prerequisite for data science projects that include text analysis such as natural language processing (NLP).