Claude-skill-registry autoviz
Automatic exploratory data analysis and visualization with a single line of code - generates comprehensive charts, detects patterns, and exports to HTML/notebooks
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/autoviz" ~/.claude/skills/majiayu000-claude-skill-registry-autoviz && rm -rf "$T"
skills/data/autoviz/SKILL.mdAutoViz Automatic EDA Skill
Master AutoViz for instant exploratory data analysis with a single line of code. Generate comprehensive visualizations, detect patterns, identify outliers, and export publication-ready charts automatically.
When to Use This Skill
USE AutoViz when:
- Quick EDA - Need rapid insights into a new dataset
- Initial exploration - Starting analysis on unfamiliar data
- Pattern discovery - Automatically detect relationships between variables
- Presentation prep - Need charts quickly for stakeholder meetings
- Large datasets - Built-in sampling handles big data efficiently
- Feature analysis - Understanding distribution and importance of features
- Correlation hunting - Finding relationships without manual chart creation
- Report generation - Export comprehensive HTML reports
DON'T USE AutoViz when:
- Custom visualizations - Need highly specific chart designs
- Interactive dashboards - Use Streamlit or Dash instead
- Real-time data - Streaming visualization requirements
- Production systems - Charts for automated pipelines (use Plotly/Altair)
- Precise statistical tests - Need formal hypothesis testing
- Domain-specific plots - Specialized visualizations not in standard EDA
Prerequisites
# Basic installation pip install autoviz # With all visualization backends pip install autoviz matplotlib seaborn plotly bokeh # Using uv (recommended) uv pip install autoviz pandas matplotlib seaborn plotly # Jupyter notebook support pip install autoviz ipywidgets notebook # Verify installation python -c "from autoviz import AutoViz_Class; print('AutoViz ready!')"
Core Capabilities
1. Basic One-Line EDA
Simplest Usage:
from autoviz import AutoViz_Class # Initialize AutoViz AV = AutoViz_Class() # Automatic visualization with one line # Returns a dataframe and generates all charts df_analyzed = AV.AutoViz( filename="data.csv", sep=",", depVar="", # Target variable (optional) dfte=None, # Pass DataFrame directly instead of filename header=0, verbose=1, # 0=minimal, 1=medium, 2=detailed output lowess=False, chart_format="svg", max_rows_analyzed=150000, max_cols_analyzed=30 ) print(f"Analyzed {df_analyzed.shape[0]} rows, {df_analyzed.shape[1]} columns")
From DataFrame:
from autoviz import AutoViz_Class import pandas as pd # Load your data df = pd.read_csv("sales_data.csv") # Or create sample data df = pd.DataFrame({ "revenue": [100, 200, 150, 300, 250, 400, 350, 500], "units": [10, 20, 15, 30, 25, 40, 35, 50], "category": ["A", "B", "A", "B", "A", "B", "A", "B"], "region": ["North", "South", "East", "West", "North", "South", "East", "West"], "profit": [20, 40, 30, 60, 50, 80, 70, 100], "customer_age": [25, 35, 45, 55, 30, 40, 50, 60] }) # Initialize and visualize AV = AutoViz_Class() # Pass DataFrame directly using dfte parameter df_result = AV.AutoViz( filename="", # Empty when using dfte sep=",", depVar="profit", # Optional: specify target variable dfte=df, header=0, verbose=1, chart_format="png" )
With Target Variable Analysis:
from autoviz import AutoViz_Class import pandas as pd # Classification dataset df_classification = pd.DataFrame({ "feature_1": [1.2, 2.3, 1.5, 3.4, 2.1, 4.5, 3.2, 5.1], "feature_2": [0.5, 1.2, 0.8, 2.1, 1.0, 3.2, 2.4, 4.0], "feature_3": ["low", "medium", "low", "high", "medium", "high", "medium", "high"], "target": [0, 0, 0, 1, 0, 1, 1, 1] }) AV = AutoViz_Class() # Specify target variable for focused analysis df_analyzed = AV.AutoViz( filename="", sep=",", depVar="target", # Target variable for classification dfte=df_classification, header=0, verbose=2, # More detailed output chart_format="svg" ) # Regression dataset df_regression = pd.DataFrame({ "size": [1000, 1500, 1200, 2000, 1800, 2500, 2200, 3000], "bedrooms": [2, 3, 2, 4, 3, 4, 4, 5], "location": ["urban", "suburban", "urban", "rural", "suburban", "rural", "suburban", "rural"], "age": [5, 10, 3, 15, 8, 20, 12, 25], "price": [200000, 280000, 220000, 350000, 300000, 380000, 340000, 420000] }) # Analyze with continuous target df_analyzed = AV.AutoViz( filename="", sep=",", depVar="price", # Continuous target dfte=df_regression, header=0, verbose=1, chart_format="png" )
2. Chart Format and Output Options
Different Chart Formats:
from autoviz import AutoViz_Class import pandas as pd df = pd.read_csv("data.csv") AV = AutoViz_Class() # SVG format (vector, scalable) df_svg = AV.AutoViz( filename="", dfte=df, chart_format="svg", # Scalable vector graphics verbose=1 ) # PNG format (raster, good for presentations) df_png = AV.AutoViz( filename="", dfte=df, chart_format="png", # PNG images verbose=1 ) # HTML format (interactive, for web) df_html = AV.AutoViz( filename="", dfte=df, chart_format="html", # Interactive HTML verbose=1 ) # Bokeh backend for interactive plots df_bokeh = AV.AutoViz( filename="", dfte=df, chart_format="bokeh", # Bokeh interactive verbose=1 ) # Server mode (for Jupyter notebooks) df_server = AV.AutoViz( filename="", dfte=df, chart_format="server", # Inline in notebook verbose=1 )
Saving Charts to Directory:
from autoviz import AutoViz_Class import pandas as pd import os # Create output directory output_dir = "analysis_output" os.makedirs(output_dir, exist_ok=True) df = pd.read_csv("data.csv") AV = AutoViz_Class() # Save all charts to specified directory df_analyzed = AV.AutoViz( filename="", dfte=df, chart_format="png", save_plot_dir=output_dir, # Directory to save plots verbose=1 ) # List generated files for file in os.listdir(output_dir): print(f"Generated: {file}")
3. Handling Large Datasets
Sampling Strategies:
from autoviz import AutoViz_Class import pandas as pd import numpy as np # Create large dataset np.random.seed(42) large_df = pd.DataFrame({ "feature_" + str(i): np.random.randn(500000) for i in range(20) }) large_df["category"] = np.random.choice(["A", "B", "C", "D"], 500000) large_df["target"] = np.random.randint(0, 2, 500000) print(f"Dataset size: {large_df.shape}") AV = AutoViz_Class() # Control sampling with max_rows_analyzed df_analyzed = AV.AutoViz( filename="", dfte=large_df, max_rows_analyzed=100000, # Sample 100K rows max_cols_analyzed=25, # Limit columns analyzed verbose=1, chart_format="png" ) # For very large datasets, use smaller sample df_analyzed_small = AV.AutoViz( filename="", dfte=large_df, max_rows_analyzed=50000, # Smaller sample for speed max_cols_analyzed=15, verbose=0, # Minimal output chart_format="svg" )
Memory-Efficient Analysis:
from autoviz import AutoViz_Class import pandas as pd def analyze_large_file(file_path: str, sample_size: int = 100000) -> pd.DataFrame: """ Analyze large files efficiently with sampling. Args: file_path: Path to CSV file sample_size: Number of rows to sample Returns: Analyzed DataFrame """ # Read only a sample for initial analysis total_rows = sum(1 for _ in open(file_path)) - 1 # Exclude header if total_rows > sample_size: # Calculate skip probability skip_prob = 1 - (sample_size / total_rows) # Read with sampling df = pd.read_csv( file_path, skiprows=lambda i: i > 0 and np.random.random() < skip_prob ) else: df = pd.read_csv(file_path) print(f"Sampled {len(df)} rows from {total_rows} total") AV = AutoViz_Class() return AV.AutoViz( filename="", dfte=df, verbose=1, chart_format="png" ) # Usage # df_result = analyze_large_file("huge_dataset.csv", sample_size=75000)
4. Feature Analysis and Distribution Plots
Understanding Feature Distributions:
from autoviz import AutoViz_Class import pandas as pd import numpy as np # Create dataset with various distributions np.random.seed(42) df = pd.DataFrame({ # Normal distribution "normal": np.random.normal(100, 15, 1000), # Skewed distribution "skewed": np.random.exponential(50, 1000), # Bimodal distribution "bimodal": np.concatenate([ np.random.normal(30, 5, 500), np.random.normal(70, 5, 500) ]), # Uniform distribution "uniform": np.random.uniform(0, 100, 1000), # Categorical with different frequencies "category_balanced": np.random.choice(["A", "B", "C"], 1000), "category_imbalanced": np.random.choice( ["Common", "Rare", "Very Rare"], 1000, p=[0.8, 0.15, 0.05] ), # Target variable "target": np.random.choice([0, 1], 1000, p=[0.7, 0.3]) }) AV = AutoViz_Class() # AutoViz will automatically: # 1. Detect distribution types # 2. Create appropriate histograms # 3. Show box plots for numerical features # 4. Create bar charts for categorical features # 5. Highlight potential outliers df_analyzed = AV.AutoViz( filename="", dfte=df, depVar="target", verbose=2, chart_format="svg" )
Categorical Feature Analysis:
from autoviz import AutoViz_Class import pandas as pd import numpy as np # Dataset with multiple categorical features df = pd.DataFrame({ "product_category": np.random.choice( ["Electronics", "Clothing", "Food", "Home", "Sports"], 1000 ), "customer_segment": np.random.choice( ["Premium", "Standard", "Budget"], 1000, p=[0.2, 0.5, 0.3] ), "region": np.random.choice( ["North", "South", "East", "West"], 1000 ), "channel": np.random.choice( ["Online", "Store", "Mobile"], 1000 ), "revenue": np.random.exponential(500, 1000), "quantity": np.random.randint(1, 20, 1000) }) AV = AutoViz_Class() # AutoViz creates: # - Bar charts for each categorical variable # - Cross-tabulation visualizations # - Category vs numerical variable plots df_analyzed = AV.AutoViz( filename="", dfte=df, depVar="revenue", verbose=1, chart_format="png" )
5. Correlation Detection
Automatic Correlation Analysis:
from autoviz import AutoViz_Class import pandas as pd import numpy as np # Create dataset with known correlations np.random.seed(42) n = 1000 # Base variables x1 = np.random.randn(n) x2 = np.random.randn(n) df = pd.DataFrame({ "x1": x1, "x2": x2, # Strongly correlated with x1 "y1": x1 * 2 + np.random.randn(n) * 0.5, # Moderately correlated with x2 "y2": x2 + np.random.randn(n) * 1.5, # Negatively correlated "y3": -x1 + np.random.randn(n) * 0.8, # No correlation "y4": np.random.randn(n), # Non-linear relationship "y5": x1 ** 2 + np.random.randn(n) * 0.5, # Target "target": (x1 + x2 > 0).astype(int) }) AV = AutoViz_Class() # AutoViz generates: # 1. Correlation heatmap # 2. Scatter plots for highly correlated pairs # 3. Pair plots for feature relationships df_analyzed = AV.AutoViz( filename="", dfte=df, depVar="target", verbose=2, chart_format="svg" )
Correlation with Lowess Smoothing:
from autoviz import AutoViz_Class import pandas as pd import numpy as np # Dataset with non-linear relationships np.random.seed(42) x = np.linspace(0, 10, 500) df = pd.DataFrame({ "x": x, "linear": 2 * x + np.random.randn(500) * 2, "quadratic": x ** 2 + np.random.randn(500) * 5, "sinusoidal": 10 * np.sin(x) + np.random.randn(500) * 2, "logarithmic": 5 * np.log(x + 1) + np.random.randn(500), "target": x + np.random.randn(500) }) AV = AutoViz_Class() # Enable lowess smoothing to see trends df_analyzed = AV.AutoViz( filename="", dfte=df, depVar="target", lowess=True, # Enable lowess smoothing verbose=1, chart_format="png" )
6. Outlier Detection and Highlighting
Automatic Outlier Identification:
from autoviz import AutoViz_Class import pandas as pd import numpy as np # Create dataset with outliers np.random.seed(42) n = 1000 # Normal data with injected outliers revenue = np.concatenate([ np.random.normal(1000, 200, n - 20), # Normal values np.random.uniform(3000, 5000, 10), # High outliers np.random.uniform(-500, 0, 10) # Low outliers ]) units = np.concatenate([ np.random.normal(50, 10, n - 15), np.random.uniform(150, 200, 15) # Outliers ]) df = pd.DataFrame({ "revenue": revenue, "units": units, "cost": np.abs(revenue * 0.6 + np.random.randn(n) * 100), "category": np.random.choice(["A", "B", "C"], n), "region": np.random.choice(["North", "South", "East", "West"], n) }) AV = AutoViz_Class() # AutoViz automatically: # 1. Detects outliers using IQR method # 2. Highlights them in box plots # 3. Shows them in scatter plots # 4. Reports outlier counts df_analyzed = AV.AutoViz( filename="", dfte=df, verbose=2, chart_format="svg" )
Custom Outlier Analysis Wrapper:
from autoviz import AutoViz_Class import pandas as pd import numpy as np def analyze_with_outlier_report(df: pd.DataFrame, target: str = "") -> dict: """ Run AutoViz and provide detailed outlier report. Args: df: Input DataFrame target: Target variable name (optional) Returns: Dictionary with analysis results and outlier info """ # Calculate outliers before visualization outlier_info = {} numeric_cols = df.select_dtypes(include=[np.number]).columns for col in numeric_cols: Q1 = df[col].quantile(0.25) Q3 = df[col].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)] outlier_info[col] = { "count": len(outliers), "percentage": len(outliers) / len(df) * 100, "lower_bound": lower_bound, "upper_bound": upper_bound, "min_outlier": outliers[col].min() if len(outliers) > 0 else None, "max_outlier": outliers[col].max() if len(outliers) > 0 else None } # Run AutoViz AV = AutoViz_Class() df_analyzed = AV.AutoViz( filename="", dfte=df, depVar=target, verbose=1, chart_format="png" ) return { "analyzed_df": df_analyzed, "outlier_report": outlier_info, "total_outliers": sum(info["count"] for info in outlier_info.values()) } # Usage # result = analyze_with_outlier_report(df, target="revenue") # print(f"Total outliers found: {result['total_outliers']}")
7. Export to HTML and Notebooks
HTML Report Generation:
from autoviz import AutoViz_Class import pandas as pd import os def generate_html_report( df: pd.DataFrame, output_dir: str, report_name: str = "eda_report", target: str = "" ) -> str: """ Generate comprehensive HTML report with AutoViz. Args: df: Input DataFrame output_dir: Directory for output files report_name: Name for the report target: Target variable (optional) Returns: Path to generated report """ os.makedirs(output_dir, exist_ok=True) AV = AutoViz_Class() # Generate HTML charts df_analyzed = AV.AutoViz( filename="", dfte=df, depVar=target, chart_format="html", save_plot_dir=output_dir, verbose=1 ) # Create summary HTML html_content = f""" <!DOCTYPE html> <html> <head> <title>{report_name} - AutoViz EDA Report</title> <style> body {{ font-family: Arial, sans-serif; margin: 20px; }} h1 {{ color: #333; }} .summary {{ background: #f5f5f5; padding: 15px; border-radius: 5px; }} .chart-container {{ margin: 20px 0; }} </style> </head> <body> <h1>{report_name}</h1> <div class="summary"> <h2>Dataset Summary</h2> <p>Rows: {len(df):,}</p> <p>Columns: {len(df.columns)}</p> <p>Numeric columns: {len(df.select_dtypes(include=['number']).columns)}</p> <p>Categorical columns: {len(df.select_dtypes(include=['object', 'category']).columns)}</p> <p>Target variable: {target if target else 'Not specified'}</p> </div> <h2>Column Information</h2> <table border="1" style="border-collapse: collapse;"> <tr><th>Column</th><th>Type</th><th>Non-Null</th><th>Unique</th></tr> """ for col in df.columns: html_content += f""" <tr> <td>{col}</td> <td>{df[col].dtype}</td> <td>{df[col].notna().sum()}</td> <td>{df[col].nunique()}</td> </tr> """ html_content += """ </table> <h2>Generated Charts</h2> <p>Charts have been saved to the output directory.</p> </body> </html> """ report_path = os.path.join(output_dir, f"{report_name}.html") with open(report_path, "w") as f: f.write(html_content) return report_path # Usage # report_path = generate_html_report(df, "output/eda", "sales_analysis", "revenue") # print(f"Report saved to: {report_path}")
Jupyter Notebook Integration:
# In Jupyter Notebook from autoviz import AutoViz_Class import pandas as pd # Load data df = pd.read_csv("data.csv") # Initialize AutoViz AV = AutoViz_Class() # Use 'server' format for inline display in notebooks %matplotlib inline df_analyzed = AV.AutoViz( filename="", dfte=df, depVar="target", chart_format="server", # Display inline in notebook verbose=1 ) # Alternative: Use bokeh for interactive plots in notebooks df_analyzed = AV.AutoViz( filename="", dfte=df, depVar="target", chart_format="bokeh", # Interactive Bokeh plots verbose=1 )
Export to Notebook File:
from autoviz import AutoViz_Class import pandas as pd import nbformat as nbf import os def create_eda_notebook( df: pd.DataFrame, output_path: str, dataset_name: str = "dataset" ) -> str: """ Create a Jupyter notebook with AutoViz EDA. Args: df: Input DataFrame output_path: Path for output notebook dataset_name: Name for the dataset Returns: Path to created notebook """ nb = nbf.v4.new_notebook() cells = [ nbf.v4.new_markdown_cell(f"# Exploratory Data Analysis: {dataset_name}"), nbf.v4.new_code_cell(""" from autoviz import AutoViz_Class import pandas as pd import warnings warnings.filterwarnings('ignore') """), nbf.v4.new_markdown_cell("## Load Data"), nbf.v4.new_code_cell(f""" # Data is pre-loaded df = pd.read_csv("{dataset_name}.csv") # Update path as needed print(f"Dataset shape: {{df.shape}}") df.head() """), nbf.v4.new_markdown_cell("## AutoViz Analysis"), nbf.v4.new_code_cell(""" AV = AutoViz_Class() df_analyzed = AV.AutoViz( filename="", dfte=df, chart_format="server", verbose=1 ) """), nbf.v4.new_markdown_cell("## Summary Statistics"), nbf.v4.new_code_cell(""" df.describe() """), nbf.v4.new_markdown_cell("## Missing Values"), nbf.v4.new_code_cell(""" missing = df.isnull().sum() missing[missing > 0].sort_values(ascending=False) """) ] nb.cells = cells with open(output_path, "w") as f: nbf.write(nb, f) return output_path # Usage # notebook_path = create_eda_notebook(df, "eda_analysis.ipynb", "sales_data")
Complete Examples
Example 1: Sales Data EDA Pipeline
from autoviz import AutoViz_Class import pandas as pd import numpy as np from datetime import datetime, timedelta import os def sales_eda_pipeline( data_path: str, output_dir: str, target_column: str = "revenue" ) -> dict: """ Complete EDA pipeline for sales data using AutoViz. Args: data_path: Path to sales data CSV output_dir: Directory for output files target_column: Target variable for analysis Returns: Dictionary with analysis results """ os.makedirs(output_dir, exist_ok=True) # Load data print("Loading data...") df = pd.read_csv(data_path) # Basic data info print(f"Dataset shape: {df.shape}") print(f"Columns: {list(df.columns)}") # Data type summary dtype_summary = df.dtypes.value_counts() print(f"\nData types:\n{dtype_summary}") # Missing values missing = df.isnull().sum() missing_pct = (missing / len(df) * 100).round(2) missing_df = pd.DataFrame({ "missing_count": missing, "missing_pct": missing_pct }) missing_df = missing_df[missing_df["missing_count"] > 0] if len(missing_df) > 0: print(f"\nMissing values:\n{missing_df}") else: print("\nNo missing values found") # Run AutoViz print("\nRunning AutoViz analysis...") AV = AutoViz_Class() df_analyzed = AV.AutoViz( filename="", dfte=df, depVar=target_column if target_column in df.columns else "", chart_format="png", save_plot_dir=output_dir, max_rows_analyzed=100000, verbose=1 ) # Calculate additional statistics numeric_cols = df.select_dtypes(include=[np.number]).columns stats = { "shape": df.shape, "memory_mb": df.memory_usage(deep=True).sum() / 1024**2, "missing_values": missing.sum(), "numeric_columns": len(numeric_cols), "categorical_columns": len(df.columns) - len(numeric_cols) } if target_column in df.columns: target_stats = df[target_column].describe().to_dict() stats["target_stats"] = target_stats # Save summary summary_path = os.path.join(output_dir, "eda_summary.txt") with open(summary_path, "w") as f: f.write(f"EDA Summary - {datetime.now()}\n") f.write("=" * 50 + "\n\n") f.write(f"Dataset: {data_path}\n") f.write(f"Shape: {df.shape}\n") f.write(f"Memory: {stats['memory_mb']:.2f} MB\n\n") f.write("Columns:\n") for col in df.columns: f.write(f" - {col}: {df[col].dtype}\n") print(f"\nAnalysis complete! Results saved to: {output_dir}") return { "dataframe": df_analyzed, "statistics": stats, "output_dir": output_dir } # Generate sample data for testing def generate_sample_sales_data(n_rows: int = 10000) -> pd.DataFrame: """Generate sample sales data for testing.""" np.random.seed(42) dates = pd.date_range( start="2024-01-01", end="2025-12-31", periods=n_rows ) return pd.DataFrame({ "date": dates, "product_id": np.random.randint(1000, 9999, n_rows), "category": np.random.choice( ["Electronics", "Clothing", "Food", "Home", "Sports"], n_rows ), "region": np.random.choice( ["North", "South", "East", "West"], n_rows ), "revenue": np.random.exponential(500, n_rows), "units": np.random.randint(1, 50, n_rows), "cost": np.random.exponential(300, n_rows), "customer_age": np.random.normal(40, 15, n_rows).astype(int), "is_promotion": np.random.choice([0, 1], n_rows, p=[0.7, 0.3]) }) # Usage # sample_df = generate_sample_sales_data(10000) # sample_df.to_csv("sample_sales.csv", index=False) # results = sales_eda_pipeline("sample_sales.csv", "sales_eda_output", "revenue")
Example 2: Machine Learning Feature Analysis
from autoviz import AutoViz_Class import pandas as pd import numpy as np from sklearn.datasets import make_classification, make_regression import os def ml_feature_analysis( X: pd.DataFrame, y: pd.Series, task_type: str = "classification", output_dir: str = "ml_eda" ) -> dict: """ Analyze features for machine learning using AutoViz. Args: X: Feature DataFrame y: Target Series task_type: 'classification' or 'regression' output_dir: Output directory Returns: Analysis results dictionary """ os.makedirs(output_dir, exist_ok=True) # Combine features and target df = X.copy() df["target"] = y print(f"Feature Analysis for {task_type}") print(f"Features: {len(X.columns)}") print(f"Samples: {len(X)}") # Feature statistics feature_stats = [] for col in X.columns: stats = { "feature": col, "dtype": str(X[col].dtype), "missing": X[col].isnull().sum(), "unique": X[col].nunique(), "mean": X[col].mean() if np.issubdtype(X[col].dtype, np.number) else None, "std": X[col].std() if np.issubdtype(X[col].dtype, np.number) else None } feature_stats.append(stats) feature_stats_df = pd.DataFrame(feature_stats) # Run AutoViz AV = AutoViz_Class() df_analyzed = AV.AutoViz( filename="", dfte=df, depVar="target", chart_format="png", save_plot_dir=output_dir, verbose=2 ) # Calculate feature correlations with target numeric_cols = X.select_dtypes(include=[np.number]).columns correlations = {} for col in numeric_cols: corr = df[col].corr(df["target"]) correlations[col] = corr corr_df = pd.DataFrame.from_dict( correlations, orient="index", columns=["correlation"] ).sort_values("correlation", key=abs, ascending=False) # Save feature importance summary corr_df.to_csv(os.path.join(output_dir, "feature_correlations.csv")) feature_stats_df.to_csv(os.path.join(output_dir, "feature_statistics.csv")) print(f"\nTop correlated features:") print(corr_df.head(10)) return { "analyzed_df": df_analyzed, "feature_stats": feature_stats_df, "correlations": corr_df, "output_dir": output_dir } # Generate classification dataset def create_classification_dataset(n_samples: int = 5000) -> tuple: """Create sample classification dataset.""" X, y = make_classification( n_samples=n_samples, n_features=15, n_informative=8, n_redundant=3, n_classes=2, random_state=42 ) feature_names = [f"feature_{i}" for i in range(X.shape[1])] X_df = pd.DataFrame(X, columns=feature_names) # Add categorical features X_df["category_1"] = np.random.choice(["A", "B", "C"], n_samples) X_df["category_2"] = np.random.choice(["Low", "Medium", "High"], n_samples) y_series = pd.Series(y, name="target") return X_df, y_series # Generate regression dataset def create_regression_dataset(n_samples: int = 5000) -> tuple: """Create sample regression dataset.""" X, y = make_regression( n_samples=n_samples, n_features=12, n_informative=6, noise=10, random_state=42 ) feature_names = [f"feature_{i}" for i in range(X.shape[1])] X_df = pd.DataFrame(X, columns=feature_names) # Add categorical features X_df["region"] = np.random.choice(["North", "South", "East", "West"], n_samples) X_df["segment"] = np.random.choice(["Premium", "Standard", "Budget"], n_samples) y_series = pd.Series(y, name="target") return X_df, y_series # Usage # X, y = create_classification_dataset(5000) # results = ml_feature_analysis(X, y, "classification", "classification_eda") # X, y = create_regression_dataset(5000) # results = ml_feature_analysis(X, y, "regression", "regression_eda")
Example 3: Multi-Dataset Comparison
from autoviz import AutoViz_Class import pandas as pd import numpy as np import os from datetime import datetime def compare_datasets( datasets: dict, output_dir: str = "comparison_output" ) -> dict: """ Compare multiple datasets using AutoViz. Args: datasets: Dictionary of {name: DataFrame} output_dir: Output directory Returns: Comparison results """ os.makedirs(output_dir, exist_ok=True) comparison_results = {} AV = AutoViz_Class() for name, df in datasets.items(): print(f"\n{'='*50}") print(f"Analyzing: {name}") print(f"{'='*50}") # Create dataset-specific output directory dataset_dir = os.path.join(output_dir, name.replace(" ", "_")) os.makedirs(dataset_dir, exist_ok=True) # Run AutoViz df_analyzed = AV.AutoViz( filename="", dfte=df, chart_format="png", save_plot_dir=dataset_dir, verbose=1 ) # Collect statistics numeric_cols = df.select_dtypes(include=[np.number]).columns stats = { "rows": len(df), "columns": len(df.columns), "numeric_columns": len(numeric_cols), "categorical_columns": len(df.columns) - len(numeric_cols), "missing_values": df.isnull().sum().sum(), "memory_mb": df.memory_usage(deep=True).sum() / 1024**2 } # Numeric column statistics if len(numeric_cols) > 0: stats["numeric_summary"] = df[numeric_cols].describe().to_dict() comparison_results[name] = { "stats": stats, "output_dir": dataset_dir } # Create comparison summary summary_data = [] for name, result in comparison_results.items(): row = {"dataset": name} row.update(result["stats"]) summary_data.append(row) summary_df = pd.DataFrame(summary_data) summary_path = os.path.join(output_dir, "comparison_summary.csv") summary_df.to_csv(summary_path, index=False) print(f"\n{'='*50}") print("Comparison Summary") print(f"{'='*50}") print(summary_df[["dataset", "rows", "columns", "missing_values", "memory_mb"]]) return { "results": comparison_results, "summary": summary_df, "output_dir": output_dir } # Create sample datasets for comparison def create_comparison_datasets() -> dict: """Create multiple datasets for comparison.""" np.random.seed(42) # Dataset 1: Sales Q1 df_q1 = pd.DataFrame({ "revenue": np.random.exponential(1000, 5000), "units": np.random.randint(1, 100, 5000), "category": np.random.choice(["A", "B", "C"], 5000), "region": np.random.choice(["North", "South"], 5000), "month": np.random.choice(["Jan", "Feb", "Mar"], 5000) }) # Dataset 2: Sales Q2 (different distribution) df_q2 = pd.DataFrame({ "revenue": np.random.exponential(1200, 6000), # Higher revenue "units": np.random.randint(5, 120, 6000), # More units "category": np.random.choice(["A", "B", "C", "D"], 6000), # New category "region": np.random.choice(["North", "South", "East"], 6000), "month": np.random.choice(["Apr", "May", "Jun"], 6000) }) # Dataset 3: Customer data df_customers = pd.DataFrame({ "age": np.random.normal(40, 15, 3000).astype(int), "income": np.random.exponential(50000, 3000), "tenure_months": np.random.randint(1, 120, 3000), "segment": np.random.choice(["Premium", "Standard", "Budget"], 3000), "is_active": np.random.choice([0, 1], 3000, p=[0.2, 0.8]) }) return { "Sales Q1": df_q1, "Sales Q2": df_q2, "Customer Data": df_customers } # Usage # datasets = create_comparison_datasets() # comparison = compare_datasets(datasets, "multi_dataset_comparison")
Integration Examples
AutoViz with Streamlit
import streamlit as st from autoviz import AutoViz_Class import pandas as pd import os import tempfile st.set_page_config(page_title="AutoViz EDA Tool", layout="wide") st.title("AutoViz Exploratory Data Analysis") # File upload uploaded_file = st.file_uploader("Upload CSV file", type=["csv"]) if uploaded_file is not None: df = pd.read_csv(uploaded_file) st.subheader("Data Preview") st.dataframe(df.head(100)) col1, col2 = st.columns(2) with col1: st.metric("Rows", f"{len(df):,}") with col2: st.metric("Columns", len(df.columns)) # Target variable selection target = st.selectbox( "Select target variable (optional)", ["None"] + list(df.columns) ) if st.button("Run AutoViz Analysis"): with st.spinner("Generating visualizations..."): # Create temp directory for outputs with tempfile.TemporaryDirectory() as tmpdir: AV = AutoViz_Class() df_analyzed = AV.AutoViz( filename="", dfte=df, depVar="" if target == "None" else target, chart_format="png", save_plot_dir=tmpdir, verbose=0 ) # Display generated charts st.subheader("Generated Visualizations") for file in os.listdir(tmpdir): if file.endswith(".png"): st.image(os.path.join(tmpdir, file)) st.success("Analysis complete!")
AutoViz with Polars
from autoviz import AutoViz_Class import polars as pl import pandas as pd def autoviz_polars(lf: pl.LazyFrame, target: str = "", **kwargs) -> pd.DataFrame: """ Run AutoViz on Polars LazyFrame. Args: lf: Polars LazyFrame target: Target variable name **kwargs: Additional AutoViz parameters Returns: Analyzed DataFrame """ # Collect LazyFrame to DataFrame, then convert to pandas df_polars = lf.collect() df_pandas = df_polars.to_pandas() AV = AutoViz_Class() return AV.AutoViz( filename="", dfte=df_pandas, depVar=target, **kwargs ) # Usage # lf = pl.scan_csv("data.csv") # df_analyzed = autoviz_polars(lf, target="revenue", chart_format="png")
Best Practices
1. Sample Large Datasets
# GOOD: Use sampling for initial exploration AV.AutoViz( filename="", dfte=large_df, max_rows_analyzed=50000, # Sample for speed verbose=1 ) # AVOID: Analyzing millions of rows directly # This will be slow and may crash
2. Specify Target Variable When Available
# GOOD: Specify target for focused analysis AV.AutoViz( filename="", dfte=df, depVar="target_column", # Enables target-specific charts verbose=1 ) # LESS USEFUL: No target specified # Still works but misses target-related insights
3. Choose Appropriate Chart Format
# For presentations: PNG chart_format="png" # For reports/web: HTML chart_format="html" # For notebooks: server or bokeh chart_format="server" # For scalable graphics: SVG chart_format="svg"
4. Organize Output
# GOOD: Save to organized directory import os output_dir = f"eda_{datetime.now().strftime('%Y%m%d_%H%M%S')}" os.makedirs(output_dir, exist_ok=True) AV.AutoViz( filename="", dfte=df, save_plot_dir=output_dir, chart_format="png" )
Troubleshooting
Common Issues
Issue: Charts not displaying in Jupyter
# Solution: Use server format %matplotlib inline AV.AutoViz(filename="", dfte=df, chart_format="server")
Issue: Memory error with large dataset
# Solution: Reduce sample size AV.AutoViz( filename="", dfte=df, max_rows_analyzed=25000, # Reduce sample max_cols_analyzed=15 # Limit columns )
Issue: Too many charts generated
# Solution: Limit columns analyzed df_subset = df[["col1", "col2", "col3", "target"]] AV.AutoViz(filename="", dfte=df_subset)
Issue: Categorical columns not recognized
# Solution: Convert to proper dtype df["category"] = df["category"].astype("category") AV.AutoViz(filename="", dfte=df)
Issue: Date columns causing issues
# Solution: Convert to datetime or extract features df["date"] = pd.to_datetime(df["date"]) df["year"] = df["date"].dt.year df["month"] = df["date"].dt.month df_features = df.drop(columns=["date"]) AV.AutoViz(filename="", dfte=df_features)
Version History
- 1.0.0 (2026-01-17): Initial release
- Basic one-line EDA functionality
- Chart format options (png, svg, html, bokeh, server)
- Large dataset handling with sampling
- Feature distribution analysis
- Correlation detection
- Outlier identification
- HTML and notebook export
- Complete pipeline examples
- Integration with Streamlit and Polars
- Best practices and troubleshooting
Resources
- Official Documentation: https://github.com/AutoViML/AutoViz
- PyPI: https://pypi.org/project/autoviz/
- Tutorial: https://towardsdatascience.com/autoviz-a-new-tool-for-automated-visualization-ec9c1744a6ad
- Examples: https://github.com/AutoViML/AutoViz/tree/master/examples
Automate your exploratory data analysis with AutoViz - one line of code, comprehensive insights!