Skip to content

Semantic3D

The Semantic3D dataset is a collection of point clouds containing in total over four billion points. The dataset comes with two different versions, semantic-8 and reduced-8 (down sampled) where we uniquely use the first one.

Description

The original dataset has the following structure.

data/semantic3d/original/
├── testing
│   ├── neugasse_1.laz
│   └── untermaederbrunnen_3.laz
├── training
│   ├── bildstein_1.laz
│   ├── bildstein_3.laz
│   ├── domfountain_1.laz
│   ├── domfountain_2.laz
│   ├── sg27_5.laz
│   └── untermaederbrunnen_1.laz
└── validation
    ├── bildstein_5.laz
    ├── domfountain_3.laz
    └── sg27_9.laz

Each file is a large point cloud representing a church, street, etc. Moreover, each point cloud contains the usual attributes x, y, z, intensity, red, blue and green. A classification label is present in the point cloud files used for training any segmentation model, that is inside data/semantic3d/original/training directory.

Note

To better process the dataset in batches when training and validating a deep learning model, we already offer a split version of it that can be found within the data/semantic3d/split/ directory of the project. Each big point cloud file, for instance data/semantic3d/original/training/sg27_5.laz, is split in many smaller clouds of roughly 102400 points located at data/semantic3d/split/training/. If you are curious how we handle the splitting, take a look at Partition a large dataset.

The cloud points belong to 9 different classes where 0 corresponds to unlabeled.

deepoints/src/deepoints/datasets/semantic3d/base.py
@staticmethod
def classes() -> list[str]:
    return [
        'unlabeled',
        'man-made terrain',
        'natural terrain',
        'high vegetation',
        'low vegetation',
        'buildings',
        'hard scape',
        'scanning artefacts',
        'cars'
    ]

Usage

Using the dataset is dead simple with the lightning.LightningDataModule-based interface.

from deepoints.datasets.semantic3d import Semantic3D
import lightning

# defines a trainer model
trainer = lightning.Trainer(...)
# any segmentation model you like
model = ...
# declares a Semantic3D loader that randomly samples 
# 8 points clouds files per batch per 4096 points per file
datamodule = Semantic3D(sample_size = 4096, batch_size = 8)
# train your model on the dataset
trainer.fit(model, datamodule = datamodule)

References

Here you find a reference to the original paper that introduced the dataset.

@article{semantic3d,
  author       = {Timo Hackel and
                  Nikolay Savinov and
                  Lubor Ladicky and
                  Jan Dirk Wegner and
                  Konrad Schindler and
                  Marc Pollefeys},
  title        = {Semantic3D.net: {A} new Large-scale Point Cloud Classification Benchmark},
  journal      = {CoRR},
  volume       = {abs/1704.03847},
  year         = {2017}
}