Skip to main content

Module Guide: Datasets

The Datasets module is the entry point for all data in DashAI. Every other module — Models, Notebooks, Generative — depends on a dataset being loaded here first. This guide covers what the module does, how its components work, and how to get the most out of each feature.


The Dataset Module Interface

The left sidebar lists all available datasets and notebooks. Each dataset entry shows its name, row count, and column count at a glance. Clicking a dataset opens its full view in the main area.

The New Dataset/Notebook button at the top of the sidebar is the entry point for both uploading a new dataset and creating a new notebook linked to an existing one.


Uploading Data

DashAI supports four file formats. Each has a dedicated dataloader that controls how the file is parsed.

Supported Formats and Dataloaders

FormatDataloaderExtensions
CSVCSVDataLoader.csv
ExcelExcelDataLoader.xlsx, .xls
JSONJSONDataLoader.json

The upload flow is inline — everything happens within the Datasets page without navigating away.

Type Inference

After uploading a file, DashAI reads a configurable number of rows (Inference Rows, default 1000) and automatically assigns a semantic type to each column: Categorical, Float, or Integer. These types are used throughout the platform — by the Explorer tabs, the Models module for column compatibility checks, and Notebook converters for filtering applicable operations.

You can override any inferred type directly in the upload preview by clicking the dropdown in each column header. Correcting types at upload time prevents downstream issues in experiments and transformations.

CSVDataLoader Parameters

ParameterDefaultDescription
NamefilenameDisplay name for the dataset inside DashAI
Separator,Character separating column values. Use ; for European-locale Excel exports
HeaderinferRow containing column names. infer detects automatically; set a number for files with metadata rows before the header
NamesNullOverride column names manually
Encodingutf-8File character encoding. Use latin-1 or ISO-8859-1 for files with accented characters
NA valuesNullAdditional strings to treat as missing (e.g. "?", "N/A")

ExcelDataLoader Parameters

ParameterDefaultDescription
Sheet0Zero-based index of the sheet to load
Header0Zero-based row index of the column header
Use columnsNullComma-separated list of columns to load; Null loads all
Skip rowsNullRows to skip at the beginning of the sheet
N rowsNullMaximum rows to load; Null loads all
NamesNullOverride column names
NA valuesNullAdditional NA strings
Keep default NARecognize built-in NA strings automatically
True valuesNullStrings to interpret as boolean True
False valuesNullStrings to interpret as boolean False

JSONDataLoader Parameters

ParameterDefaultDescription
NamefilenameDisplay name
Data keydataKey inside the JSON object that contains the records array

The JSONDataLoader expects a structure like { "data": [{...}, {...}] }. Change Data key to match the actual key in your file if it differs from data.


Dataset Explorer (EDA)

Clicking a dataset opens its built-in EDA panel — a set of automatic analyses that run immediately with no configuration. The panel is organized into six tabs.

Quality Score

A percentage shown at the top right of every dataset view. It reflects the absence of structural data quality issues. A score of 100% means no constant columns, high cardinality issues, or potential ID columns were detected. Any score below 100% means the Data Quality tab has findings worth reviewing before training.

Overview Tab

Shows a Dataset Preview table with the actual data rows. Four toolbar controls are available:

  • COLUMNS — show/hide specific columns to focus on what matters
  • FILTERS — apply row-level filters to inspect subsets
  • DENSITY — toggle row height between compact and comfortable
  • EXPORT — download the current view

The five summary cards at the top of every dataset view give an immediate health check: Total Rows, Total Columns, File Size (MB), Duplicated Rows, and Missing Values. Non-zero values in Duplicated Rows or Missing Values indicate data quality work may be needed before training.

Numerical Analysis Tab

For every Float or Integer column, DashAI computes and displays:

Descriptive statistics: Mean, Median, Standard Deviation, Unique count

Distribution metrics: Min, Q1, Median, Q3, Max

Shape indicators: Skewness, Kurtosis, Outlier count, Range

Boxplot: Visual five-number summary. Outliers appear as points beyond the whiskers.

Intelligent alerts: DashAI detects common distribution patterns and suggests actions automatically. For example:

⚠️ Right-skewed distribution: Consider applying a log transformation.

These suggestions are actionable — if you see one, the corresponding Notebook converter (e.g., a log transform) is the recommended next step.

Categorical Tab

For every Categorical column:

  • Unique Values — how many distinct categories exist
  • Most Frequent — the dominant category value
  • Top Value Count — how many times the dominant value appears
  • Value Distribution — bar chart of all category counts
  • Proportion — pie chart showing each category's share

A heavily imbalanced distribution (where one category dominates) in your target column is a signal to consider resampling converters (SMOTE, RandomUnderSampler) before training classification models.

Text Tab

Active only when text-typed columns exist. Shows length-based statistics per column: Average Length, Median Length, Average Word Count, Unique Values, Min/Max Length, Range.

A low uniqueness warning appears when a text column has very few distinct values — this typically means the column was misclassified as text and should be Categorical. Fixing this at the dataset level (via re-upload) avoids issues downstream.

Data Quality Tab

Reports three structural issue categories:

IssueWhat it meansWhat to do
Constant ColumnsEvery row has the same value — no predictive informationRemove before training
High CardinalityA categorical column has an unusually large number of distinct valuesInvestigate — may be a free-text field or ID column in disguise
Possible ID ColumnsColumn appears to be a unique row identifierExclude from model input columns

The Missing Data Patterns panel shows whether missing values are randomly distributed or concentrated in specific columns. Concentrated missing values may indicate a systematic data collection issue worth addressing before modeling.

Correlations Tab

Computes pairwise Pearson correlations between all numerical columns. The interactive bar chart shows each column pair with color-coded bars (green = positive, red/pink = negative). Hovering shows the exact correlation value.

Strong Correlations (|r| > 0.5) are listed separately — these are the relationships most likely to be meaningful. A high correlation between two input features suggests potential redundancy; a high correlation between a feature and the target column suggests predictive value.


Notebooks

Notebooks are non-destructive workspaces attached to a dataset. They allow you to apply sequences of Explorers (visualizations) and Converters (transformations) to a working copy of the data, preview the effect of each operation live, and save the result as a new dataset.

The original dataset is never modified. All changes are isolated to the notebook's working copy until you explicitly save.

Explorer Tools (EXPLORE tab)

Explorers generate visualizations and statistical summaries from the current state of the data. They do not modify the data. Available explorers are organized into five categories:

CategoryWhat it contains
Preview InspectionDescribe Dataset (statistical summary table), Show Rows (paginated record view)
Relationship AnalysisDensity Heatmap, Multiple Scatter Plot, Scatter Plot
Statistical AnalysisCorrelation Matrix, Covariance Matrix
Distribution AnalysisBox Plot, Empirical Cumulative Distribution, Histogram Plot, Word Cloud
Multidimensional AnalysisMultiple Column Chart, Parallel Categories, Parallel Coordinates

Each explorer has a two-step configuration: first select which columns to include (scope), then set the explorer's parameters. Results render inline in the notebook timeline below the data preview.

Converter Tools (CONVERT tab)

Converters modify the data. Each is applied to a configurable set of columns and rows, and the dataset preview updates immediately after each converter runs. Available converters are organized into eight categories:

Basic Preprocessing

ConverterWhat it does
NaN RemoverRemoves rows that contain at least one missing value
Simple ImputerFills missing values with mean, median, most frequent, or a constant
KNN ImputerFills missing values using k-nearest neighbors
Missing IndicatorAdds binary columns marking which values were missing
Column RemoverRemoves selected columns entirely from the dataset
Character ReplacerReplaces specific characters or strings in text columns

Encoding

ConverterWhat it does
BinarizerMaps numeric values to 0 or 1 based on a threshold
Label BinarizerBinarizes labels in a one-vs-all scheme
Label EncoderEncodes categorical labels as integers (for target columns)
One-Hot EncoderCreates a binary column for each category value
Ordinal EncoderEncodes categories as ordered integers

Scaling and Normalization

ConverterWhat it does
Max Abs ScalerScales each feature by its maximum absolute value (range: -1 to 1)
Min-Max ScalerScales features to a specified range (default: 0 to 1)
NormalizerScales each row (record) to unit norm

Dimensionality Reduction

ConverterWhat it does
Principal Component AnalysisReduces to n components explaining maximum variance
Incremental PCAPCA for large datasets processed in memory-efficient batches
Truncated SVDSVD-based reduction, works with sparse matrices
Fast ICAIndependent Component Analysis
Nystroem ApproximationApproximates a kernel feature map for non-linear representation
Variance ThresholdRemoves features with variance below a threshold

Feature Selection

ConverterWhat it does
Select K BestKeeps the K features with the highest statistical scores
Select PercentileKeeps the top X% of features by score
Select FDRSelects features controlling the false discovery rate
Select FPRSelects features by p-value significance threshold
Select FWESelects features with strict family-wise error correction
Generic Univariate FilterConfigurable univariate selector combining scoring and selection mode

Polynomial & Kernel Methods

ConverterWhat it does
Polynomial FeaturesGenerates polynomial and interaction terms from input features
RBF SamplerApproximates an RBF kernel feature map using random Fourier features
Additive Chi² SamplerApproximates the additive chi-squared kernel for non-negative data
Skewed Chi² SamplerVariant of chi-squared kernel approximation with a shift parameter

Resampling & Class Balancing

ConverterWhat it does
SMOTEGenerates synthetic minority class records by interpolation
SMOTE-ENNSMOTE followed by Edited Nearest Neighbors cleaning
Random Under-SamplerRandomly removes majority class records to balance the dataset

Advanced Preprocessing

ConverterWhat it does
TF-IDFConverts text to TF-IDF feature vectors (weighted word frequencies)
Bag of WordsConverts text to raw word count vectors
TokenizerConverts text into sequences of integer token indices
EmbeddingMaps token sequences to dense semantic vector representations

Saving a Transformed Dataset

When the notebook contains the transformations you want, click SAVE AS NEW DATASET. This creates a new independent dataset in DashAI with the data in its current state. The new dataset is immediately available for experiments without affecting the source dataset.


Tips

  • Use the Quality Score as a first-pass health check before doing any analysis. A score below 100% always has a specific cause visible in the Data Quality tab.
  • The Intelligent Alerts in Numerical Analysis are prioritized suggestions — address them with the corresponding Notebook converter before training to improve model performance.
  • Build Notebook transformation pipelines incrementally: add one converter at a time and verify the preview before adding the next.
  • Resampling converters (SMOTE, RandomUnderSampler) should be applied only to the training split, not the full dataset — keep this in mind when saving a transformed dataset for use in experiments.
  • For text data, apply TF-IDF or Bag of Words when working with traditional ML models (Logistic Regression, SVM, Random Forest). Neural models that accept raw text (like DistilBERT) do not require these pre-processing steps.