Estadistica Practica Para Ciencia De Datos Y Python High Quality Link

Un p-valor no significativo puede deberse a una muestra pequeña. Usa statsmodels.stats.power para calcular tamaño muestral necesario antes de recolectar datos.

She started with the raw data—a 5GB CSV file. pandas loaded it with a groan.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
df = pd.read_csv("clickstream.csv")
print(df.describe())

The mean time on checkout page was 47 seconds. The median was 12 seconds. That was her first clue. A giant gap between mean and median meant outliers—people who left their laptop open for hours.

"So much for 'average user'," she said.

She plotted a histogram using seaborn:

import seaborn as sns
sns.histplot(df['time_on_checkout'], bins=50, log_scale=True)
plt.title("Time on Checkout: Log-normal distribution")

It wasn't a bell curve. It was log-normal. Most users left quickly, but a long tail of ghosts haunted the data.

Her rival, Dr. Marcus Crane, insisted the problem was "price sensitivity." He ran a t-test comparing prices for buyers vs. non-buyers.

buyers = df[df.purchased == 1]['price']
non_buyers = df[df.purchased == 0]['price']
t_stat, p_value = stats.ttest_ind(buyers, non_buyers)
print(f"p-value: p_value:.5f")  # 0.32

p > 0.05. Not significant. Marcus was wrong. Un p-valor no significativo puede deberse a una

Elara had a different hypothesis: users abandoned when the website's JavaScript error count exceeded a threshold. But errors were rare. How do you prove a rare event causes drop-off?

stats.mannwhitneyu(lunch, dinner, alternative='two-sided')

Never skip this.

En ciencia de datos, frecuentemente necesitamos validar suposiciones. Por ejemplo: "¿El nuevo diseño web realmente genera más ventas que el anterior?". Aquí entran las pruebas de hipótesis. She started with the raw data—a 5GB CSV file

# Matriz de correlación rápida
corr_matrix = df.corr(method='pearson')  # 'spearman' para relaciones no lineales