Un p-valor no significativo puede deberse a una muestra pequeña. Usa statsmodels.stats.power para calcular tamaño muestral necesario antes de recolectar datos.
She started with the raw data—a 5GB CSV file. pandas loaded it with a groan.
import pandas as pd import numpy as np import matplotlib.pyplot as plt import scipy.stats as stats
df = pd.read_csv("clickstream.csv") print(df.describe())
The mean time on checkout page was 47 seconds. The median was 12 seconds. That was her first clue. A giant gap between mean and median meant outliers—people who left their laptop open for hours.
"So much for 'average user'," she said.
She plotted a histogram using seaborn:
import seaborn as sns
sns.histplot(df['time_on_checkout'], bins=50, log_scale=True)
plt.title("Time on Checkout: Log-normal distribution")
It wasn't a bell curve. It was log-normal. Most users left quickly, but a long tail of ghosts haunted the data.
Her rival, Dr. Marcus Crane, insisted the problem was "price sensitivity." He ran a t-test comparing prices for buyers vs. non-buyers.
buyers = df[df.purchased == 1]['price']
non_buyers = df[df.purchased == 0]['price']
t_stat, p_value = stats.ttest_ind(buyers, non_buyers)
print(f"p-value: p_value:.5f") # 0.32
p > 0.05. Not significant. Marcus was wrong. Un p-valor no significativo puede deberse a una
Elara had a different hypothesis: users abandoned when the website's JavaScript error count exceeded a threshold. But errors were rare. How do you prove a rare event causes drop-off?
stats.mannwhitneyu(lunch, dinner, alternative='two-sided')
Never skip this.
En ciencia de datos, frecuentemente necesitamos validar suposiciones. Por ejemplo: "¿El nuevo diseño web realmente genera más ventas que el anterior?". Aquí entran las pruebas de hipótesis. She started with the raw data—a 5GB CSV file
# Matriz de correlación rápida
corr_matrix = df.corr(method='pearson') # 'spearman' para relaciones no lineales