nsudhanva@gmail.com. +91-9632350260

This post contains guidelines, best practices, tools to be used, while you're a developer. It applies to Data Analysts, Data Engineers, Machine Learning Engineers, Data Scientists or any research team in general


Contents

Git and Github

  • If you're new to Git and Github/Gitlab watch this course from Udacity
  • Ensure your Git client is configured with the correct email address and linked to your GitHub/Gitlab user
  • Use git-based repositories, all code pushed to the company's GitLab (or GitHub). Request from your manager the access to your respective groups so that you can create repositories and push your code
  • Don't push your code directly to master branch. Use branches, tags.
  • Always send a Pull Request (Merge Request) to your senior developer. If you're working alone, send it to yourself.
  • Install and use Github Desktop for better code management and visibility. Install Github CLI and Hub CLI if you're a CLI pro
  • Read more here

.gitignore

  • Be sure to ignore trivial files, dependencies
  • Ignore larger files such as images, cache, private key files
  • If you're not aware of what to be ignored, use gitignore.io to help yourself create a .gitignore file

Commit Messages

You're not expected to follow everything mentioned in the below links but rather develop a habit of writing good commit messages

Secret Keys

  • Never, ever commit any of the API Keys, Secret Keys, Tokens, URLs or Passwords in any of the files.
  • Read more here and here
  • Use .env files and read the keys from the environmental variables. It depends on the language and tools you use. Eg: Python or Node or Docker
  • You should exclude .env file from commits by adding .env to the .gitignore. You can also upload an example configuration .env.sample with dummy data or blanks to show the schema your application requires
  • In case you commit a secret key by mistake, do notify to your senior developer or manager at the earliest. Read more on the removal of sensitive data here

README.md

  • Be sure to include a README.md file in every repository you create
  • Find best practices here and try to incorporate whichever suits your work

Githooks

Git hooks are scripts that Git executes before or after events such as: commit, push, and receive. Checkout Githooks

Data

  • Use Data Version Control. DVC usually runs along with Git. Git is used as usual to store and version code (including DVC meta-files). DVC helps to store data and model files seamlessly out of Git, while preserving almost the same user experience as if they were stored in Git itself
  • Read more at their site and here

ML

Notebooks

You should write notebooks in such a way that anyone can rerun it on the same inputs, and produce the same outputs. Your notebook should be executable from top to bottom and should contain the information required to set up the correct, consistent environment. Create templates for common tasks so that it can be used by other team members. Also use JupyterLabs instead of the traditional Jupyter Notebooks. Avoid using Google Colab unless it's absolutely necessary.

Summary

  • Follow established software development best practices: OOP, style guides, documentation
  • You should institute version control for your Notebooks
  • Reproducible Notebooks
  • Continuous Integration (CI)
  • Parameterized Notebooks
  • Continuous Deployment (CD)
  • Log all experiments automatically

Notebook guidelines

  • Organizing your code: Write classes, modules in separate files and import these into your notebooks. Keep your notebook clean and do not write too many lines of code
  • Variables: Re-create new variables. Do not hard-code numerical constants, URL strings etc. Use a python global constant for the same
  • TDD: Write test cases for your modules. Read first here and then here

Tracking Experiments

ML Ops

You've successfully subscribed to Sudhanva
Welcome back! You've successfully signed in.
Great! You've successfully signed up.
Your link has expired
Success! Your account is fully activated, you now have access to all content.