What I Learned Using ChatGPT in Data Science

Photo by the author

There is one lesson I learned from using ChatGPT. It is extremely helpful for data science, but everything it generates must be carefully analyzed. He is great at some tasks and can perform them very quickly and accurately. For some other tasks this is sufficient and you will need to view it several times. There is also one task that ChatGPT does quite well.

You can attach your dataset to a ChatGPT message and with a few basic instructions, ChatGPT will be able to mine the data for you.

For example, I can download a dataset from this data projectThe instructions I provided are as follows:

“Operate the attached data to conduct a descriptive statistical analysis. Include the following:

Summarize basic statistics (mean, median, standard deviation, etc.).
“Identify missing values and propose strategies to deal with them.”

Returns a summary that looks like this. Performs the same calculations for each variable.

Age:

Average: 28.79 years
Standard deviation: 6.94 years
Range: 18 to 50 years

There were also no missing values identified in the dataset.

If you also need Python code for these calculations, you can ask it to write them down.

To load a dataset, exploit this code.

aerofit_data = pd.read_csv(file_path)

The basic statistics are:

basic_stats = aerofit_data.describe()

You can check for missing values with this code.

missing_values = aerofit_data.isnull().sum()

Furthermore, I can ask ChatGPT to visualize the distribution of key variables and detect potential outliers and anomalies.

Creates histograms and box plots for key variables: Age, Income, and Miles. Detects possible outliers in the distribution of income and miles.

What I Learned Using ChatGPT for Data Science

Created by Author/ChatGPT

What I Learned from Using ChatGPT for Data Science

Created by author/ChatGPT

He also interprets visualizations. He therefore notes that the distribution of income is skewed to the right, indicating that most customers have incomes at the lower end of the spectrum, with a smaller number of customers earning much higher incomes. The boxplot suggests that there are some outliers at the top.

Created by Author/ChatGPT

The same interpretation applies to the Miles distribution: right skewed and outliers at the upper end.

Given the nature of the data, this suggests that these outliers are not necessarily errors, but rather represent a specific customer segment. In relation to skewed distributions, it suggests transformations (e.g., log transformation) to normalize the data.

Of course, you can also ask him to write Python code for these visualizations.

I can exploit the same data and ask ChatGPT if it is suitable for logistic regression. And if not, what should I do to transform it to make it suitable?

The response was that my data were primarily suitable for descriptive analysis and inference of continuous data.

Below are proposed methods for transforming data for logistic regression.

Create binary output: Create a binary score by categorizing customers as “high utilization” or “low utilization” based on the median split of the Utilization variable. Additionally, the binary outcome could be “buying a treadmill” vs. “no treadmill purchase.”

Check multicollinearity: Operate correlation matrix Or Variance Inflation Factor (VIF) to check for multicollinearity. If multicollinearity exists, correlated variables can be removed or combined.

Feature scaling: Although scaling features to fit the model is not necessary in logistic regression, it can be beneficial convergence. This is especially true for objects with significantly different scales.

Solving nonlinearity problems: Transformations such as: journal Or square rootcan be used for variables that do not satisfy the linearity assumption.

Sample Size: The dataset must be immense enough given the number of predictor variables.

Data preprocessing: Handling outliers and categorical variables (via one-hot encoding or label encoding) is suggested. Also, it suggests handling missing values, but this doesn’t apply to my data as it doesn’t have any missing values.

ChatGPT can be helpful at every stage of your data science project. You can ask it to write web scraping code. You can ask it to explore and prepare data for machine learning algorithms. It can also provide code for choosing the right algorithm, splitting the dataset, A/B testing, and algorithms to improve performance. Finally, you can also ask ChatGPT to write code for implementing the model.

While ChatGPT can do all this, it’s quite possible that you won’t immediately get code that does exactly what you want. So you need to understand what you want to achieve and check the code. Chances are you’ll need to ask ChatGPT to fix the code and provide more instructions on what to change.

ChatGPT can assist users who are still learning fresh programming languages. Imagine being a SQL master and having no problems solving is a job interview question. However, you want to add Python to your skillset. ChatGPT can assist you with this. You can copy the SQL code and ask ChatGPT to convert it to Python.

The SQL code looks like this.

SELECT 
  company, 
  profit 
FROM 
  (
    SELECT 
      *, 
      rank() OVER (
        ORDER BY 
          profit DESC
      ) as rank 
    FROM 
      (
        SELECT 
          company, 
          sum(profits) AS profit 
        FROM 
          forbes_global_2010_2014 
        GROUP BY 
          company
      ) sq
  ) sq2 
WHERE 
  rank <= 3;

ChatGPT wrote Python code based on the above SQL code, but it does not work correctly. So I am engaging in criticizing ChatGPT for not writing the correct code. And I ask it to try again, because the output is not the same as the SQL code. It tries again and returns the correct result.

import pandas as pd

# Assuming 'forbes_global_2010_2014' is a pandas DataFrame with columns 'company' and 'profits'
# Since we don't have the actual data, this code assumes that the DataFrame is already loaded

# Step 1: Group by company and sum the profits
company_profits_sum = forbes_global_2010_2014.groupby('company')['profits'].sum().reset_index()

# Step 2: Rank the companies by profit in descending order
company_profits_sum['rank'] = company_profits_sum['profits'].rank(ascending=False, method='max')

# Step 3: Filter the top 3 companies
top_companies = company_profits_sum[company_profits_sum['rank'] <= 3]

# Final selection of columns
result = top_companies[['company', 'profits']].sort_values('profits', ascending=False)

result  # This will display the top 3 companies by profit

As you can see, you get nice commented code before each block of code. To improve your learning, you can ask him to interpret this code. This will take you even deeper into what each line of code does and how.

When I say bad, I mean really bad! He gets to the point where he uses the right formula and plugs in the right values, but somehow manages to mess up the not-so-complicated calculations.

Look at this. I asked him to solve this problem: “Suppose you roll a fair six-sided die 10 times. What is the probability that you will get two ones, three twos, one three, zero fours, three fives, and one six?

This is how it calculates the probability.

This breaks down when calculating factorials. And it does it with style! The completely wrong statement 2! = 12. It's not, it's 2. How can you break down a basic calculation like 2x1 = 2? This is really comical!

What's even funnier is that once, 3! = 36, and the second time 3! = 6. That's a credit because it was correct at least once.

When I ask him to correct the calculation without further explanation, he recalculates and gets a probability of 0.0001389. I couldn't believe my eyes! He can exploit the exact same formula and values and get a different result that is still wrong!

I asked him to correct the calculation again and finally returned the correct result: 0.0008336. Third time's the charm!

I admit that these mistakes were made by ChatGPT 3.5. I asked the same question to ChatGPT 4 and it gave the correct calculations the first time. You can also exploit some math plugins to be sheltered.

The main conclusion from all this is that ChatGPT is a bad master but a very good servant. It can be helpful in writing code, debugging, analyzing and visualizing data. However, never trust it completely and never accept what it writes without checking.

Check the code that it writes and verify the calculations. After all, you are a data scientist and ChatGPT is not! You may not get the desired results from ChatGPT on your first try. But providing more precise instructions and a few tries will probably get you to the desired result.

Nate Rosidi is a data scientist and product strategist. He is also an associate professor of analytics and the founder of StrataScratch, a platform that helps data scientists prepare for interviews with real questions from top companies. Nate writes about the latest job trends, provides interview advice, shares data science projects, and covers all things SQL.

Categories

What I Learned Using ChatGPT in Data Science

5 key changes D&A leaders need to make to ensure analytics and AI success

COBOL is the asbestos of programming languages

Japan approves world’s first treatment using reprogrammed human cells

Wall Street is already betting on markets based on forecasts

AI psychosis lawyer warns of risk of mass casualties

More News

5 key changes D&A leaders need to make to ensure analytics and AI success

Japan approves world’s first treatment using reprogrammed human cells

A novel study details why cats almost always land on their paws

You can approximate Pi by dropping needles on the floor

5 key changes D&A leaders need to make to ensure analytics and AI success

COBOL is the asbestos of programming languages

Japan approves world’s first treatment using reprogrammed human cells