- By - Gaurav Masand
- Posted on
- Posted in AI, ChatGPT, Research
ChatGPT for automation of Pandas Dataframes (Python-based)
Before you start
Pandas AI or pandas-ai is useful with Pandas module in Python 3 and JuPyter Notebook. It is a Python library that integrates generative artificial intelligence capabilities into Pandas, making dataframes conversational!!! Whether you’re a beginner or an experienced data scientist, PandasAI can help you to take your data analysis to the next level.
Introduction:
The field of data science has revolutionized the modern world and continues to grow with time. Data scientists use a myriad number of tools to identify the patterns to make decisions. Python along with its famous module ‘Pandas’ are routinely used for execution of multiple tasks.
Pandas AI is built on top of Pandas, which is a fast, powerful, and flexible data manipulation library. Pandas is widely used in the data science community for a variety of tasks, including data cleaning, data transformation, and data analysis. Pandas AI extends the capabilities of Pandas by adding generative AI functionality to it.
Pandas AI includes several other features, including data augmentation and data imputation. Data augmentation is the process of adding new data to an existing dataset, while data imputation is the process of filling in missing data in a dataset. These features are useful for data scientists who need to work with incomplete datasets or who want to increase the size of their datasets for testing and analysis.
Another feature of Pandas AI is its ability to handle time-series data. Time-series data is a type of data that is collected over time, such as stock prices or weather data. Pandas AI can handle time-series data (real-world as well as synthetic data), making it a valuable tool for data scientists who need to work with this type of data.
Pandas AI also allows a variety of machine learning algorithms that can be used for data analysis and modeling. These algorithms include clustering, classification, and regression algorithms. These algorithms can be used to analyze data, identify patterns, and make predictions. Pandas AI is a Python library that adds generative artificial intelligence capabilities to Pandas, the popular data analysis and manipulation tool. It is designed to be used in conjunction with Pandas, and is not a replacement for it.
Requirements
- Python 3
- Pandas
- Pandas AI: pip install pandasai
- API Key for ChatGPT
In order to set the API key for the LLM (Hugging Face Hub, OpenAI), you need to set the appropriate environment variables. You can do this by copying the .env.example file to .env:
cp .env.example .env
Then, edit the .env file and set the appropriate values. As an alternative, you can also pass the environment variables directly to the constructor of the LLM:
# OpenAI
llm = OpenAI(api_token=”YOUR_OPENAI_API_KEY”)
# OpenAssistant
llm = OpenAssistant(api_token=”YOUR_HF_API_KEY”)
Usage
PandasAI is a tool that works together with Pandas, a popular library for data analysis and manipulation. It adds a new level of interactivity to Pandas, allowing you to ask questions about your data and receive answers in the form of Pandas DataFrames. This means that you can use plain language to ask PandasAI to find specific information in your dataset, and it will provide you with the relevant results.
For example, you could ask PandasAI to find all the rows in a DataFrame where a particular column’s value is greater than 5. The tool will go through your dataset and return a new DataFrame containing only the rows that meet that criterion. This is a great way to quickly extract information from large datasets, without having to write complex code or spend hours manually searching through your data. Here is the simple code for this:
import pandas as pd
from pandasai import PandasAI
# Sample DataFrame
df = pd.DataFrame({
"country": ["United States", "United Kingdom", "France", "Germany", "Italy", "Spain", "Canada", "Australia", "Japan", "China"],
"gdp": [21400000, 2940000, 2830000, 3870000, 2160000, 1350000, 1780000, 1320000, 516000, 14000000],
"happiness_index": [7.3, 7.2, 6.5, 7.0, 6.0, 6.3, 7.3, 7.3, 5.9, 5.0]
})
# Instantiate a LLM
from pandasai.llm.openai import OpenAI
llm = OpenAI()
pandas_ai = PandasAI(llm)
pandas_ai.run(df, prompt='Which are the 5 happiest countries?')
The above code will return the following:
0 United States
6 Canada
7 Australia
1 United Kingdom
3 Germany
Name: country, dtype: object
Of course, you can also ask PandasAI to perform more complex queries. For example, you can ask PandasAI to find the sum of the GDPs of the 2 unhappiest countries:
pandas_ai.run(df, prompt='What is the sum of the GDPs of the 2 unhappiest countries?')
The above code will return the following:
14516000
You can also ask PandasAI to draw a graph:
pandas_ai.run(
df,
"Plot the histogram of countries showing for each the gpd, using different colors for each bar",
)
Environment Variables
In order to set the API key for the LLM (Hugging Face Hub, OpenAI), you need to set the appropriate environment variables. You can do this by copying the .env.example file to .env:
cp .env.example .env
Then, edit the .env file and set the appropriate values.
As an alternative, you can also pass the environment variables directly to the constructor of the LLM:
# OpenAI
llm = OpenAI(api_token="YOUR_OPENAI_API_KEY")
# OpenAssistant
llm = OpenAssistant(api_token="YOUR_HF_API_KEY")