so You Want Git for Data

Aspects of "Git" that you might want for data

  • Version Control
    • rollbacks
    • diffs
    • lineage
    • branch-merge
    • sharing
    • addressability
    • multiple remotes
    • staging area
  • Data Catalog
    • thriving open data community
    • collaborate remotely and asynchronously
    • pull requests
    • create issues referring to certain parts of the data
  • Types
  • transformation can happen externally
  • Data Labelling can happen as statement metadata?

Three types of solutions

The products fell into three general categories:

1. [[Data catalogs|t.cs.data.catalog]]
2. Data pipeline versioning
3. Versioned databases