Pandas Eda Cheat Sheet



Pandas can be used as the most important Python package for Data Science. It helps to provide a lot of functions that deal with the data in easier way. It's fast, flexible, and expressive data structures are designed to make real-world data analysis.

Nov 20, 2019 onster Hunter World Iceborne has re-worked the Augment System for weapons. Social media platforms are forever changing the image sizes and formats, so to keep you all updated I have re-created the 2019 social media image sizes cheat sheet and updated it to 2020. This is the HDR program I use for a more realistic result. Pandas Cheat Sheet is a quick guide through the basics of Pandas that you will need to get started on wrangling your data with Python. If you want to begin your data science journey with Pandas, you can use it as a handy reference to deal with the data easily.

Pandas Cheat Sheet is a quick guide through the basics of Pandas that you will need to get started on wrangling your data with Python. If you want to begin your data science journey with Pandas, you can use it as a handy reference to deal with the data easily.

This cheat sheet will guide through the basics of the Pandas library from the data structure to I/O, selection, sorting and ranking, etc.

Key and Imports

We use following shorthand in the cheat sheet:

  • df: Refers to any Pandas Dataframe object.
  • s: Refers to any Pandas Series object. You can use the following imports to get started:

Pandas Eda Cheat Sheet Pdf

Importing Data

  • pd.read_csv(filename) : It read the data from CSV file.
  • pd.read_table(filename) : It is used to read the data from delimited text file.
  • pd.read_excel(filename) : It read the data from an Excel file.
  • pd.read_sql(query,connection _object) : It read the data from a SQL table/database.
  • pd.read_json(json _string) : It read the data from a JSON formatted string, URL or file.
  • pd.read_html(url) : It parses an html URL, string or the file and extract the tables to a list of dataframes.
  • pd.read_clipboard() : It takes the contents of clipboard and passes it to the read_table() function.
  • pd.DataFrame(dict) : From the dict, keys for the columns names, values for the data as lists.

Exporting data

Pandas Eda Cheat Sheet Download

  • df.to_csv(filename): It writes to a CSV file.
  • df.to_excel(filename): It writes to an Excel file.
  • df.to_sql(table_name, connection_object): It writes to a SQL table.
  • df.to_json(filename) : It write to a file in JSON format.

Create Test objects

It is useful for testing the code segments.

  • pd.DataFrame(np.random.rand(7,18)): Refers to 18 columns and 7 rows of random floats.
  • pd.Series(my_list): It creates a Series from an iterable my_list.
  • df.index= pd.date_range('1940/1/20', periods=df.shape[0]): It adds the date index.

Viewing/Inspecting Data

  • df.head(n): It returns first n rows of the DataFrame.
  • df.tail(n): It returns last n rows of the DataFrame.
  • df.shape: It returns number of rows and columns.
  • df.info(): It returns index, Datatype, and memory information.
  • s.value_counts(dropna=False): It views unique values and counts.
  • df.apply(pd.Series.value_counts): It refers to the unique values and counts for all the columns.

Selection

  • df[col1]: It returns column with the label col as Series.
  • df[[col1, col2]]: It returns columns as a new DataFrame.
  • s.iloc[0]: It select by the position.
  • s.loc['index_one']: It select by the index.
  • df.iloc[0,:]: It returns first row.
  • df.iloc[0,0]: It returns the first element of first column.

Data cleaning

  • df.columns = ['a','b','c']: It rename the columns.
  • pd.isnull(): It checks for the null values and returns the Boolean array.
  • pd.notnull(): It is opposite of pd.isnull().
  • df.dropna(): It drops all the rows that contain the null values.
  • df.dropna(axis= 1): It drops all the columns that contain null values.
  • df.dropna(axis=1,thresh=n): It drops all the rows that have less than n non null values.
  • df.fillna(x): It replaces all null values with x.
  • s.fillna(s.mean()): It replaces all the null values with the mean(the mean can be replaced with almost any function from the statistics module).
  • s.astype(float): It converts the datatype of series to float.
  • s.replace(1, 'one'): It replaces all the values equal to 1 with 'one'.
  • s.replace([1,3],[ 'one', 'three']):It replaces all 1 with 'one' and 3 with 'three'.
  • df.rename(columns=lambda x: x+1):It rename mass of the columns.
  • df.rename(columns={'old_name': 'new_ name'}): It consist selective renaming.
  • df.set_index('column_one'): Used for changing the index.
  • df.rename(index=lambda x: x+1): It rename mass of the index.

Filter, Sort, and Groupby

  • df[df[col] > 0.5]: Returns the rows where column col is greater than 0.5
  • df[(df[col] > 0.5) & (df[col] < 0.7)] : Returns the rows where 0.7 > col > 0.5
  • df.sort_values(col1) :It sorts the values by col1 in ascending order.
  • df.sort_values(col2,ascending=False) :It sorts the values by col2 in descending order.
  • df.sort_values([col1,col2],ascending=[True,False]) :It sort the values by col1 in ascending order and col2 in descending order.
  • df.groupby(col1): Returns a groupby object for the values from one column.
  • df.groupby([col1,col2]) :Returns a groupby object for values from multiple columns.
  • df.groupby(col1)[col2]) :Returns mean of the values in col2, grouped by the values in col1.
  • df.pivot_table(index=col1,values=[col2,col3],aggfunc=mean) :It creates the pivot table that groups by col1 and calculate mean of col2 and col3.
  • df.groupby(col1).agg(np.mean) :It calculates the average across all the columns for every unique col1 group.
  • df.apply(np.mean) :Its task is to apply the function np.mean() across each column.
  • nf.apply(np.max,axis=1) :Its task is to apply the function np.max() across each row.

Join/Combine

Pandas Eda Cheat Sheet
  • df1.append(df2): Its task is to add the rows in df1 to the end of df2(columns should be identical).
  • pd.concat([df1, df2], axis=1): Its task is to add the columns in df1 to the end of df2(rows should be identical).
  • df1.join(df2,on=col1,how='inner'): SQL-style join the columns in df1 with the columns on df2 where the rows for col have identical values, 'how' can be of 'left', 'right', 'outer', 'inner'.

Statistics

The statistics functions can be applied to a Series, which are as follows:

  • df.describe(): It returns the summary statistics for the numerical columns.
  • df.mean() : It returns the mean of all the columns.
  • df.corr(): It returns the correlation between the columns in the dataframe.
  • df.count(): It returns the count of all the non-null values in each dataframe column.
  • df.max(): It returns the highest value from each of the columns.
  • df.min(): It returns the lowest value from each of the columns.
  • df.median(): It returns the median from each of the columns.
  • df.std(): It returns the standard deviation from each of the columns.
Next TopicPandas Index


DataCamp Free Week - all premium content free until April 30, 2021

Pandas Profiling Eda


Latest:

  • Improving model performance through human participation- Apr 23, 2021.
    Certain industries, such as medicine and finance, are sensitive to false positives. Using human input in the model inference loop can increase the final precision and recall. Here, we describe how to incorporate human feedback at inference time, so that Machines + Humans = Higher Precision & Recall.
  • Data Science Books You Should Start Reading in 2021- Apr 23, 2021.
    Check out this curated list of the best data science books for any level.
  • What is Adversarial Neural Cryptography?- Apr 22, 2021.
    The novel approach combines GANs and cryptography in a single, powerful security method.
  • How to ace A/B Testing Data Science Interviews- Apr 22, 2021.
    Understanding the process of A/B testing and knowing how to discuss this approach during data science job interviews can give you a leg up over other candidates. This mock interview provides a step-by-step guide through how to demonstrate your mastery of the key concepts and logical considerations.
  • Top 10 Must-Know Machine Learning Algorithms for Data Scientists – Part 1- Apr 22, 2021.
    New to data science? Interested in the must-know machine learning algorithms in the field? Check out the first part of our list and introductory descriptions of the top 10 algorithms for data scientists to know.
  • Production-Ready Machine Learning NLP API with FastAPI and spaCy- Apr 21, 2021.
    Learn how to implement an API based on FastAPI and spaCy for Named Entity Recognition (NER), and see why the author used FastAPI to quickly build a fast and robust machine learning API.
  • 10 Must-Know Statistical Concepts for Data Scientists- Apr 21, 2021.
    Statistics is a building block of data science. If you are working or plan to work in this field, then you will encounter the fundamental concepts reviewed for you here. Certainly, there is much more to learn in statistics, but once you understand these basics, then you can steadily build your way up to advanced topics.
  • Time Series Forecasting with PyCaret Regression Module- Apr 21, 2021.
    PyCaret is an alternate low-code library that can be used to replace hundreds of lines of code with few lines only. See how to use PyCaret's Regression Module for Time Series Forecasting.
  • Data Analysis Using Tableau- Apr 20, 2021.
    Read this overview of using Tableau for sale data analysis, and see how visualization can help tell the business story.
  • Data Science 101: Normalization, Standardization, and Regularization- Apr 20, 2021.
    Normalization, standardization, and regularization all sound similar. However, each plays a unique role in your data preparation and model building process, so you must know when and how to use these important procedures.
  • Want To Get Good At Time Series Forecasting? Predict The Weather- Apr 20, 2021.
    This article is designed to help the reader understand the components of a time series.
  • How to organize your data science project in 2021- Apr 19, 2021.
    Maintaining proper organization of all your data science projects will increase your productivity, minimize errors, and increase your development efficiency. This tutorial will guide you through a framework on how to keep everything in order on your local machine and in the cloud.
  • Free From Stanford: Machine Learning with Graphs- Apr 19, 2021.
    Check out the freely-available Stanford course Machine Learning with Graphs, taught by Jure Leskovec, and see how a world renowned researcher teaches their topic of expertise. Accessible materials include slides, videos, and more.
  • What makes a song popular? Analyzing Top Songs on Spotify- Apr 16, 2021.
    With so many great (and not-so-great) songs out there, it can be hard to find those that match your musical preferences. Follow along this ML model building project to explore the extensive song data available on Spotify and design a recommendation engine that could help you discover your next favorite artist!
  • Essential Math for Data Science: Linear Transformation with Matrices- Apr 16, 2021.
    You’ll start seeing matrices, not only as operations on numbers, but also as a way to transform vector spaces. This conception will give you the foundations needed to understand more complex linear algebra concepts like matrix decomposition.
  • Top 3 Statistical Paradoxes in Data Science- Apr 15, 2021.
    Observation bias and sub-group differences generate statistical paradoxes.
  • ETL in the Cloud: Transforming Big Data Analytics with Data Warehouse Automation- Apr 15, 2021.
    Today, organizations are increasingly implementing cloud ETL tools to handle large data sets. With data sets becoming larger by the day, unified ETL tools have become crucial for data integration needs of enterprises.
  • Is Your Model Overtained?- Apr 14, 2021.
    WeightWatcher is based on theoretical research (done injoint with UC Berkeley) into Why Deep Learning Works, based on our Theory of Heavy Tailed Self-Regularization (HT-SR). It uses ideas from Random Matrix Theory (RMT), Statistical Mechanics, and Strongly Correlated Systems.
  • Continuous Training for Machine Learning – a Framework for a Successful Strategy- Apr 14, 2021.
    A basic appreciation by anyone who builds machine learning models is that the model is not useful without useful data. This doesn't change after a model is deployed to production. Effectively monitoring and retraining models with updated data is key to maintaining valuable ML solutions, and can be accomplished with effective approaches to production-level continuous training that is guided by the data.
  • Automated Anomaly Detection Using PyCaret- Apr 13, 2021.
    Learn to automate anomaly detection using the open source machine learning library PyCaret.
  • 10 Real-Life Applications of Reinforcement Learning- Apr 12, 2021.
    In this article, we’ll look at some of the real-world applications of reinforcement learning.
  • Zero-Shot Learning: Can you classify an object without seeing it before?- Apr 12, 2021.
    Developing machine learning models that can perform predictive functions on data it has never seen before has become an important research area called zero-shot learning. We tend to be pretty great at recognizing things in the world we never saw before, and zero-shot learning offers a possible path toward mimicking this powerful human capability.
  • How to Apply Transformers to Any Length of Text- Apr 12, 2021.
    Read on to find how to restore the power of NLP for long sequences.
  • Interpretable Machine Learning: The Free eBook- Apr 9, 2021.
    Interested in learning more about interpretability in machine learning? Check out this free eBook to learn about the basics, simple interpretable models, and strategies for interpreting more complex black box models.
  • Deep Learning Recommendation Models (DLRM): A Deep Dive- Apr 9, 2021.
    The currency in the 21st century is no longer just data. It's the attention of people. This deep dive article presents the architecture and deployment issues experienced with the deep learning recommendation model, DLRM, which was open-sourced by Facebook in March 2019.
  • NoSQL Explained: Understanding Key-Value Databases- Apr 8, 2021.
    Among the four big NoSQL database types, key-value stores are probably the most popular ones due to their simplicity and fast performance. Let’s further explore how key-value stores work and what are their practical uses.
  • A/B Testing: 7 Common Questions and Answers in Data Science Interviews, Part 2- Apr 8, 2021.
    In this second article in this series, we’ll continue to take an interview-driven approach by linking some of the most commonly asked interview questions to different components of A/B testing, including selecting ideas for testing, designing A/B tests, evaluating test results, and making ship or no ship decisions.
  • E-commerce Data Analysis for Sales Strategy Using Python- Apr 7, 2021.
    Check out this informative and concise case study applying data analysis using Python to a well-defined e-commerce scenario.
  • Microsoft Research Trains Neural Networks to Understand What They Read- Apr 7, 2021.
    The new models make inroads in a new areas of deep learning known as machine reading comprehension.
  • Working With Time Series Using SQL- Apr 6, 2021.
    This article is an overview of using SQL to manipulate time series data.
  • How to Dockerize Any Machine Learning Application- Apr 6, 2021.
    How can you -- an awesome Data Scientist -- also be known as an awesome software engineer? Docker. And these 3 simple steps to use it for your solutions over and over again.
  • Automated Text Classification with EvalML- Apr 6, 2021.
    Learn how EvalML leverages Woodwork, Featuretools and the nlp-primitives library to process text data and create a machine learning model that can detect spam text messages.
  • The Best Machine Learning Frameworks & Extensions for TensorFlow- Apr 5, 2021.
    Check out this curated list of useful frameworks and extensions for TensorFlow.
  • How to deploy Machine Learning/Deep Learning models to the web- Apr 5, 2021.
    The full value of your deep learning models comes from enabling others to use them. Learn how to deploy your model to the web and access it as a REST API, and begin to share the power of your machine learning development with the world.
  • Awesome Tricks And Best Practices From Kaggle- Apr 5, 2021.
    Easily learn what is only learned by hours of search and exploration.
  • Shapash: Making Machine Learning Models Understandable- Apr 2, 2021.
    Establishing an expectation for trust around AI technologies may soon become one of the most important skills provided by Data Scientists. Significant research investments are underway in this area, and new tools are being developed, such as Shapash, an open-source Python library that helps Data Scientists make machine learning models more transparent and understandable.
  • What’s ETL?- Apr 2, 2021.
    Discover what ETL is, and see in what ways it’s critical for data science.
  • Easy AutoML in Python- Apr 1, 2021.
    We’re excited to announce that a new open-source project has joined the Alteryx open-source ecosystem. EvalML is a library for automated machine learning (AutoML) and model understanding, written in Python.
  • A/B Testing: 7 Common Questions and Answers in Data Science Interviews, Part 1- Apr 1, 2021.
    In this article, we’ll take an interview-driven approach by linking some of the most commonly asked interview questions to different components of A/B testing, including selecting ideas for testing, designing A/B tests, evaluating test results, and making ship or no ship decisions.

March:

  • Top 10 Python Libraries Data Scientists should know in 2021, by Terence Shin
    So many Python libraries exist that offer powerful and efficient foundations for supporting your data science work and machine learning model development. While the list may seem overwhelming, there are certain libraries you should focus your time on, as they are some of the most commonly used today.
  • The Best Machine Learning Frameworks & Extensions for Scikit-learn, by Derrick Mwiti
    Learn how to use a selection of packages to extend the functionality of Scikit-learn estimators.
  • The Portfolio Guide for Data Science Beginners, by Navid Mashinchi
    Whether you are an aspiring or seasoned Data Scientist, establishing a clear and well-designed online portfolio presence will help you stand out in the industry, and provide potential employers a powerful understanding of your work and capabilities. These tips will help you brainstorm and launch your first data science portfolio.
  • More Data Science Cheatsheets, by Matthew Mayo
    It's time again to look at some data science cheatsheets. Here you can find a short selection of such resources which can cater to different existing levels of knowledge and breadth of topics of interest.
  • 10 Amazing Machine Learning Projects of 2020, by Anupam Chugh
    So much progress in AI and machine learning happened in 2020, especially in the areas of AI-generating creativity and low-to-no-code frameworks. Check out these trending and popular machine learning projects released last year, and let them inspire your work throughout 2021.
  • Must Know for Data Scientists and Data Analysts: Causal Design Patterns, by Emily Riederer
    Industry is a prime setting for observational causal inference, but many companies are blind to causal measurement beyond A/B tests. This formula-free primer illustrates analysis design patterns for measuring causal effects from observational data.
  • Know your data much faster with the new Sweetviz Python library, by Francois Bertrand
    One of the latest exploratory data analysis libraries is a new open-source Python library called Sweetviz, for just the purposes of finding out data types, missing information, distribution of values, correlations, etc. Find out more about the library and how to use it here.
  • A Machine Learning Model Monitoring Checklist: 7 Things to Track, by Emeli Dral & Elena Samuylova
    Once you deploy a machine learning model in production, you need to make sure it performs. In the article, we suggest how to monitor your models and open-source tools to use.
  • How To Overcome The Fear of Math and Learn Math For Data Science, by Arnuld On Data
    Many aspiring Data Scientists, especially when self-learning, fail to learn the necessary math foundations. These recommendations for learning approaches along with references to valuable resources can help you overcome a personal sense of not being 'the math type' or belief that you 'always failed in math.'
  • 3 Mathematical Laws Data Scientists Need To Know, by Cornellius Yudha Wijaya
    Machine learning and data science are founded on important mathematics in statistics and probability. A few interesting mathematical laws you should understand will especially help you perform better as a Data Scientist, including Benford's Law, the Law of Large Numbers, and Zipf's Law.
  • Google’s Model Search is a New Open Source Framework that Uses Neural Networks to Build Neural Networks, by Jesus Rodriguez
    The new framework brings state-of-the-art neural architecture search methods to TensorFlow.
  • Top YouTube Channels for Data Science, by Matthew Mayo
    Have a look at the top 15 YouTube channels for data science by number of subscribers, along with some additional data on the channels to help you decide if they may have some content useful for you.

2021 Tutorials, Overviews

Pandas Basic Eda

Apr | Mar | Feb | Jan

2020 Tutorials, Overviews

Dec | Nov | Oct | Sep | Aug | Jul | Jun | May | Apr | Mar | Feb | Jan |

2019 Tutorials, Overviews

Dec | Nov | Oct | Sep | Aug | Jul | Jun | May | Apr | Mar | Feb | Jan |Pandas

2018 Tutorials, Overviews

Pandas Eda Cheat Sheet 2020

Dec | Nov | Oct | Sep | Aug | Jul | Jun | May | Apr | Mar | Feb | Jan |

2017 Tutorials, Overviews

Dec | Nov | Oct | Sep | Aug | Jul | Jun | May | Apr | Mar | Feb | Jan |

Pandas Eda Cheat Sheet


Pandas Eda Cheat Sheet 2019

2016 Tutorials, Overviews

Dec | Nov | Oct | Sep | Aug | Jul | Jun | May | Apr | Mar | Feb | Jan |