Skip to main content

Semantic Types

What Are Semantic Types?

When you upload a dataset, DashAI assigns a semantic type to each column. Semantic types go beyond raw storage formats (e.g., PyArrow int32 or string) to express the ML-meaningful nature of the data: is this column a continuous measurement, a discrete label, a free-form text, a date?

This classification drives three critical behaviours throughout the platform:

  • Task compatibility — only columns whose types match a task's requirements can be selected as inputs or outputs.
  • Converter chaining — converters declare the type they accept and the type they produce, enabling safe preprocessing pipelines.
  • Label encoding — categorical output columns are automatically integer-encoded before training and decoded back to string labels after prediction.

Type Hierarchy

All semantic types inherit from a common abstract base class, DashAIDataType.

DashAIDataType
├── DashAIValue # abstract parent for all value types
│ ├── Integer # int8, int16, int32, int64 (signed or unsigned)
│ ├── Float # float16, float32, float64
│ ├── Text # string with encoding (default: UTF-8)
│ ├── Date # calendar date (default format: YYYY-MM-DD)
│ ├── Time # time of day (default format: HH:mm:ss)
│ ├── Timestamp # datetime with timezone (default: YYYY-MM-DD HH:mm:ss)
│ ├── Duration # elapsed time with unit (s, ms, us, ns)
│ ├── Decimal # precise decimal (128 or 256-bit, with precision and scale)
│ └── Binary # raw binary data
└── Categorical # discrete labels with str ↔ int encoding map

DashAIValue types represent continuous or ordered measurements. Categorical is a separate branch because it carries additional structure: the full list of unique categories and a bijective encoding mapping.


Concrete Types

Value Types

TypeKey attributesTypical use
Integerdtype (e.g. int64), signedCount features, ordinal-encoded labels
Floatdtype (e.g. float64)Continuous measurements, regression targets
Textdtype (string), encodingFree-form text, NLP tasks
Dateformat (default YYYY-MM-DD)Calendar dates
Timeformat (default HH:mm:ss)Time-of-day values
Timestampformat, timezoneDatetime with timezone
Durationunit (s, ms, us, ns)Time intervals
Decimalprecision, scale, bit_widthHigh-precision numerics
BinaryRaw byte payloads

Categorical

Categorical is the most structurally rich type. It stores:

  • categories — ordered list of unique string values (["cat", "dog", "bird"])
  • dtype — underlying PyArrow storage type (string, int64, etc.)
  • encoding / str2int — dictionary mapping each category to an integer ({"cat": 0, "dog": 1, "bird": 2})
  • decoding / int2str — reverse mapping ({0: "cat", 1: "dog", 2: "bird"})
  • converted — flag indicating whether the column has already been integer-encoded

Categorical is used for all classification target columns and for any feature column that contains a discrete set of labels (e.g., country, product category).


Type Inference

Types are assigned automatically when a dataset is loaded. DashAI supports two inference methods, selectable at upload time.

Primary: DashAIPtype

Uses the ptype probabilistic type-inference model, which analyses each column's values to estimate the most likely semantic type. Supported ptype outputs and their DashAI mappings:

ptype outputDashAI type
integerInteger (int64)
floatFloat (float64)
stringText (string, UTF-8)
booleanCategorical (string)
categoricalCategorical (string)
date-iso-8601Text (date parsing not yet automatic)
date-euText
float_commaFloat (comma-decimal normalised)

After ptype classification, any column whose unique-value count and ratio fall within configurable thresholds is further promoted to Categorical regardless of the original ptype output.

Fallback: DummyCategoricalInference

A lightweight heuristic used when ptype is unavailable:

  • String columns with fewer than 10 unique values → Categorical
  • Integer columns with fewer than 10 unique values → Categorical
  • All other integer columns → Integer
  • All other string columns → Text
  • Float columns → Float

Special cases

  • PyArrow bool columns are always mapped to Categorical (two-category: True/False).
  • PyArrow dictionary-encoded columns are mapped to Categorical with an initially empty category list.

Type Persistence

Semantic types are serialised to Apache Arrow table metadata under the key dashai_types and stored alongside the dataset's Arrow IPC file. This means:

  • Types survive save/load round-trips without re-inference.
  • Notebooks inherit the types of their source dataset.
  • Converters that change a column's type update the metadata in place.

The relevant utilities are save_types_in_arrow_metadata() and get_types_from_arrow_metadata() in DashAI/back/types/utils.py.


How Types Are Used

Task Compatibility

Every task class declares the semantic types it accepts for input and output columns via a metadata dictionary:

metadata = {
"inputs_types": [Float, Integer, Categorical], # allowed input column types
"outputs_types": [Categorical], # required output column type
"inputs_cardinality": "n", # any number of inputs
"outputs_cardinality": 1, # exactly one output
}

Before training, validate_dataset_for_task() checks that every selected column's semantic type is in the allowed set. Columns that do not match are rejected with a descriptive error.

Type requirements by task:

TaskAllowed input typesRequired output type
TabularClassificationTaskFloat, Integer, CategoricalCategorical
RegressionTaskFloat, Integer, CategoricalFloat or Integer
TextClassificationTaskText (exactly 1 column)Categorical (exactly 1 column)
TranslationTaskText (exactly 1 column)Text (exactly 1 column)

Converter Type Contracts

Each converter implements get_output_type(column_name) to declare the semantic type of each output column. This allows DashAI to track the type of every column through a multi-step preprocessing pipeline.

Common converter contracts:

ConverterInput typeOutput type
OneHotEncoderCategoricalInteger (one binary column per category)
OrdinalEncoderCategoricalInteger
LabelEncoderCategoricalInteger
LabelBinarizerCategoricalInteger
StandardScalerInteger, FloatFloat
MinMaxScalerInteger, FloatFloat
NormalizerInteger, FloatFloat
BinarizerInteger, FloatInteger
TFIDFConverterTextFloat
BagOfWordsConverterTextFloat
TokenizerConverterTextInteger
PCA, TruncatedSVD, FastICAInteger, FloatFloat

Label Encoding

Classification tasks require a Categorical output column, but most ML models require numeric targets. DashAI handles this automatically:

  1. Before trainingcategorical_label_encoder() converts each Categorical output column to Integer using the Categorical type's str2int map. The mapping is saved so that it can be reversed.
  2. After predictionprocess_predictions() applies the reverse int2str map to convert integer predictions back to their original string labels before displaying results or saving to disk.

No manual encoding step is needed from the user.

Type Validation

When a user manually changes a column's semantic type in the UI, validate_type_change() checks whether the conversion is safe and feasible:

From \ ToIntegerFloatTextCategoricalDateTimeTimestamp
Integer✓ (if low cardinality)
Float✓ (if whole numbers)✓ (if low cardinality)
Text✓ (if parseable)✓ (if parseable)✓ (if low cardinality)
Categorical
Date
Time
Timestamp

If the conversion is not safe (e.g., promoting a high-cardinality text column to Categorical), the validator returns a descriptive error before any data is modified.


Source Files

FileRole
DashAI/back/types/dashai_data_type.pyAbstract base class DashAIDataType
DashAI/back/types/dashai_value.pyAbstract intermediate class DashAIValue
DashAI/back/types/value_types.pyConcrete value type classes
DashAI/back/types/categorical.pyCategorical type with encoding logic
DashAI/back/types/utils.pyArrow ↔ DashAI type conversion, metadata I/O
DashAI/back/types/type_validation.pyvalidate_type_change() and suitability checks
DashAI/back/types/inf/inference_methods.pyDashAIPtype and DummyCategoricalInference
DashAI/back/types/inf/type_inference.pyinfer_types() entry point