Data Science

This page contains resources about Data Science, Data Engineering and Data Management.

Subfields and Concepts

 * Agile Data Science
 * Machine Learning / Data Mining
 * Exploratory Data Analysis (EDA)
 * Data Preparation and Data Preprocessing
 * Data Fusion and Data Integration
 * Data Wrangling / Data Munging
 * Data Scraping
 * Data Sampling
 * Data Cleaning
 * Data Visualization
 * Explainable AI (XAI) / Interpretable AI
 * Big Data
 * Data Engineering, Data Management and Databases
 * High Performance/Parallel/Distributed/Cloud Computing for Machine Learning
 * Concurrent/Multi-threading Computing for Machine Learning
 * Synchronous Communication (for Web Services)
 * Representational State Transfer (REST) Protocol
 * Remote Procedure Call (RPC)
 * Simple Object Access Protocol (SOAP)
 * Asynchronous Communication / Asynchronous Messaging (for Web Services)
 * Message broker/Message bus/Event bus/Integration broker/Interface engine
 * Message queue
 * Asynchronous protocols
 * Advanced Message Queuing Protocol (AMQP)
 * MQ Telemetry Transport (MQTT)
 * Messaging patterns
 * Fire-and-Forget / One-Way
 * Request-Response / Request-Reply
 * Publisher-Subscriber
 * Request-Callback
 * Software Architecture
 * Monolithic Architecture
 * Microservices Architecture
 * Service-Oriented Architecture (SOA)
 * Stream Processing

Video Lectures

 * How to Win a Data Science Competition: Learn from Top Kagglers - Coursera

Lecture Notes

 * Data Science by Ioannis Kourouklides
 * When [to use and When Not to Use Distributed Machine Learning by Chih-Jen Lin]
 * Open Machine Learning Course (Medium)
 * Mining Massive Datasets by Jure Leskovec, Anand Rajaraman and Jeff Ullman
 * Hardware Acceleration for Data Processing by Gustavo Alonso
 * CS109: Data Science

Books

 * Newman, S. (2021). Building Microservices: Designing Fine-Grained Systems. 2nd Ed. O'Reilly Media.
 * Bellemare, A. (2020). Building Event-Driven Microservices: Leveraging Organizational Data at Scale. O'Reilly Media.
 * Richards, M. (2020). Fundamentals of Software Architecture. O'Reilly Media.
 * Dean A., & Crettaz, V. (2019). Event Streams in Action. Manning.
 * Richardson, C. (2018). Microservices Patterns. Manning Publications.
 * Pacheco, V. F. (2018). Microservice Patterns and Best Practices. Packt Publishing.
 * De la Torre C., Wagner, B., & Rousos, M. (2018). .NET Microservices: Architecture for Containerized .NET Applications. Microsoft Corporation. (link)
 * Lanaro, G. (2017). Python High Performance. Packt Publishing.
 * Wickham, H., & Grolemund, G. (2017). R for Data Science. O'Reilly Media.
 * Kleppmann, M. (2017). Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. O'Reilly Media.
 * VanderPlas, J. (2016). Python Data Science Handbook: Essential Tools for Working with Data. O'Reilly Media.
 * Pierfederici, F. (2016). Distributed Computing with Python. Packt Publishing.
 * Dunning, T., & Friedman, E. (2016). Streaming Architecture: New Designs Using Apache Kafka and MapR Streams. O'Reilly Media.
 * Nolan, D., & Lang, D. T. (2015). Data Science in R: A Case Studies Approach to Computational Reasoning and Problem Solving. CRC Press.
 * Elston, S. F. (2015). Data Science in the Cloud with Microsoft Azure Machine Learning and R. O'Reilly Media, Inc.
 * Grus, J. (2015). Data Science from Scratch: First Principles with Python. O'Reilly Media.
 * Madhavan, S. (2015). Mastering Python for Data Science. Packt Publishing.
 * Kale, V. (2015). Guide to Cloud Computing for Business and Technology Managers: From Distributed Computing to Cloudware Applications. CRC Press.
 * Ejsmont, A. (2015). Web Scalability for Startup Engineers. McGraw Hill.
 * Leskovec, J., Rajaraman, A., & Ullman, J. D. (2014). Mining of Massive Datasets. Cambridge University Press. (link)
 * Zumel, N., Mount, J., & Porzak, J. (2014). Practical Data Science with R. Manning.
 * Schutt, R., & O'Neil, C. (2013). Doing Data Science: Straight Talk from the Frontline. O'Reilly Media.
 * Videla, A., & J.W. Williams, J. (2012). RabbitMQ in Action. Manning.
 * Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley.

Scholarly Articles

 * Verbraeken, J., Wolting, M., Katzy, J., Kloppenburg, J., Verbelen, T., & Rellermeyer, J. S. (2020). A Survey on Distributed Machine Learning. ACM Computing Surveys (CSUR), 53(2), 1-33.
 * Buchlovsky, P. ... (2018). TF-Replicator: Distributed Machine Learning for Researchers. arXiv preprint arXiv:1902.00465.
 * Kang, D., Emmons, J., Abuzaid, F., Bailis, P., & Zaharia, M. (2017). NoScope: optimizing neural network queries over video at scale. Proceedings of the VLDB Endowment, 10(11), 1586-1597.
 * Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). Why Should I Trust You? Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1135-1144).
 * Xing, E. P., Ho, Q., Xie, P., & Wei, D. (2016). Strategies and principles of distributed machine learning on big data. Engineering, 2(2), 179-195.
 * Salloum, S., Dautov, R., Chen, X., Peng, P. X., & Huang, J. Z. (2016). Big data analytics on Apache Spark. International Journal of Data Science and Analytics, 1(3-4), 145-164.
 * Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., ... & Dennison, D. (2015). Hidden technical debt in machine learning systems. In Advances in Neural Information Processing Systems (pp. 2503-2511).
 * Huang, Y., Zhu, F., Yuan, M., Deng, K., Li, Y., Ni, B., ... & Zeng, J. (2015). Telco Churn Prediction with Big Data. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (pp. 607-618).
 * Moritz, P., Nishihara, R., Stoica, I., & Jordan, M. I. (2015). SparkNet: Training Deep Networks in Spark. arXiv preprint arXiv:1511.06051.
 * Upadhyaya, S. R. (2013). Parallel approaches to machine learning—A comprehensive survey. Journal of Parallel and Distributed Computing, 73(3), 284-292.
 * Sakr, S., Liu, A., Batista, D. M., & Alomari, M. (2011). A survey of large scale data management approaches in cloud environments. IEEE Communications Surveys & Tutorials, 13(3), 311-336.
 * Abadi, D. J. (2009). Data management in the cloud: Limitations and opportunities. IEEE Data Eng. Bull., 32(1), 3-12.

Software

 * Docker (Containers)
 * Anaconda Distribution - Python
 * Cython - Python
 * Beautiful Soup 4 - Python
 * lxml - Python
 * Selenium - Python
 * Scrapy - Python
 * ray - Python
 * multiprocessing - Python
 * threading - Python
 * auto_ml - Python
 * Celery - Python
 * Elasticsearch, Logstash, Kibana (ELK)
 * MongoDB
 * Apache Solr
 * Apache Hadoop
 * Apache HBase
 * Apache Spark
 * Apache Hive
 * Apache Cassandra
 * Apache ZooKeeper
 * Apache Pig
 * Apache Storm
 * Apache CouchDB
 * Apache ActiveMQ
 * Apache Samza
 * Apache Flink
 * Apache Kafka (which includes Kafka Connect) - A message broker
 * RabbitMQ - A message broker
 * Redis - A message broker
 * pyspark - Spark Python API
 * tensorflow_scala - Scala API for TensorFlow
 * TensorFlowSharp - TensorFlow API for .NET languages
 * TensorFlowOnSpark - It brings TensorFlow programs onto Apache Spark clusters
 * Numba - Python
 * GraphQL
 * nginx
 * DVC - Data Version Control
 * kubeflow
 * Akka
 * Pykka
 * Heron
 * Apache Airflow - Workflow Management System
 * Druid
 * Apache Superset
 * Horovod - TensorFlow, Keras, PyTorch, and MXNet
 * Acumos AI
 * HopsML
 * Apache Arrow

General

 * What is Data Science by Ioannis Kourouklides - slides
 * Data Science Guide
 * Data Science Engineering, your way
 * A manifesto for Agile data science - blog post
 * Data Science Project Flow for Startups - blog post
 * Large Scale Machine Learning - libraries and papers
 * What are some courses on large scale learning? - Quora
 * 7 Steps to Mastering Data Preparation with Python - blog post
 * Web Scraping for Data Science with Python - blog post
 * Princeton Commodities Modeling Blog
 * Python-camp - Github
 * Big Data: Spark, Hadoop, Hive, ZooKeeper, Solr, Kafka, Nutch, MongoDB, ... - installation instructions
 * Deep Learning with Apache Spark and TensorFlow - blog post
 * Build a Simple Chatbot with Tensorflow, Python and MongoDB - blog post
 * Plotly Python Library Maps
 * 5 Quick and Easy Data Visualizations in Python with Code - blog post
 * William Koehrsen - blog
 * ClaoudML - Free Data Science & Machine Learning Resources
 * Data Science in Python: Pandas Cheat Sheet
 * Time Series Anomaly Detection: Optimizing your Machine Learning Jobs in Elasticsearch - webinar
 * Federated Learning: Collaborative Machine Learning without Centralized Training Data - blog post
 * Snap ML - IBM
 * pyspark (GitHub) - collection of resources
 * Cheat Sheets for AI, Neural Networks, Machine Learning, Deep Learning & Big Data - blog post
 * Peloton: Uber’s Unified Resource Scheduler for Diverse Cluster Workloads - blog post
 * Rules of Machine Learning: Best Practices for ML Engineering - blog post
 * Google Compute Engine Now Has Images With PyTorch 1.0.0 and FastAi 1.0.2 - blog post
 * Michelangelo PyML: Introducing Uber’s Platform for Rapid Python ML Model Development
 * Manage your Data Science project structure in early stage - blog post
 * Cookiecutter Data Science — Organize your Projects — Atom and Jupyter - blog post
 * surreal (GitHub) - code
 * cloudwise (GitHub) - code
 * caraml (GitHub) - code
 * symphony (GitHub) - code
 * TensorFlow Vs. Spark: How Do They Differ And Work In Tandem With Each Other - blog post
 * awesome-datascience (GitHub)
 * awesome-learn-datascience (GitHub)
 * When Deep Learning with GPUs, use a Cluster Manager - blog post

Data Annotation & Labelling

 * What is Data Annotation?
 * Amazon Mechanical Turk
 * CloudFactory
 * Appen
 * Alegion
 * iMerit
 * Playment
 * Rev - Transcription from video and audio
 * Labelbox
 * diffgram
 * Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
 * Data Annotation Tools for Machine Learning (Evolving Guide)
 * awesome-data-annotation (GitHub)

EDA

 * Exploratory data analysis using Python for used car database taken from Kaggle - Github
 * Detailed exploratory data analysis with Python - Kaggle
 * Exploratory data analysis in Python - PyCon 2017 (Youtube)
 * Exploratory Data Analysis with Pandas - blog post
 * Simple Exploratory Data Analysis - PASSNYC - Kaggle
 * EDA and Clustering - Kaggle
 * Introduction to Exploratory Data Analysis in Python - blog post
 * Visual Data Analysis with Python - blog post

Asynchronous Communication & Microservices

 * Pattern: Microservice Architecture
 * Software Architecture Patterns and Designs
 * Asynchronous communication with message queue
 * Message Exchange Patterns (MEPs)
 * Message exchange patterns
 * Messaging patterns
 * Synchronous vs Asynchronous communication in microservices integration
 * Redis, Kafka or RabbitMQ: Which MicroServices Message Broker To Choose?
 * Akka Streams and Kafka Streams: Where Microservices Meet Fast Data
 * Akka, Spark, or Kafka? Selecting the Right Streaming Engine
 * Luigi, Airflow, Pinball, and Chronos: Comparing Workflow Management Systems
 * kafka-streams-machine-learning-examples (GitHub) - Machine Learning + Kafka Streams Examples (with code)
 * Realtime Machine Learning predictions with Kafka and H2O.ai - blog post
 * Understanding When to use RabbitMQ or Apache Kafka
 * What is Stream Processing?
 * A Gentle Introduction to Stream Processing

Distributed Systems

 * Docker Distributed System Summit videos podcast episodes
 * Using Docker to Simplify Distributed Systems in Development - video
 * Day 11: Using Docker to build and deploy a distributed app - blog post with code
 * Intro to Distributed Deep Learning Systems - blog post
 * Parallel and Distributed Deep Learning by Tal Ben-Nun
 * An introduction to parallel programming using Python's multiprocessing module - blog post
 * Spark + Deep Learning: Distributed Deep Neural Network Training with SparkNet - blog post
 * Paper Review. Petuum: A new platform for distributed machine learning on big data - blog post
 * A comparison of distributed machine learning platform - blog post
 * Distributed Filesystems for Deep Learning - blog post
 * Distributed-TensorFlow-Guide (GitHub) - Distributed TensorFlow basics and examples of training algorithms (with code)

Deployment and Production

 * How Docker Can Help You Become A More Effective Data Scientist - blog post
 * How to Build and Deploy Scalable Machine Learning in Production with Apache Kafka - blog post
 * Deploying deep learning models: Part 1 an overview - blog post
 * A guide to deploying Machine/Deep Learning model(s) in Production - blog post
 * How redBus uses Scikit-Learn ML models to classify customer complaints? - blog post
 * Deploying a Keras Deep Learning Model as a Web Application in Python - blog post
 * Awesome-docker - A curated list of Docker resources and projects
 * Awesome-Kubernetes - A curated list for awesome kubernetes sources
 * Michael Herman - Going Serverless with OpenFaaS, Kubernetes, and Python - PyCon 2018 (Youtube)
 * Aly Sivji, Joe Jasinski, tathagata dasgupta (t) - Docker for Data Science - PyCon 2018 (Youtube)
 * Ruben Orduz, Nolan Brubaker - A Python-flavored Introduction to Containers And Kubernetes (Youtube) - PyCon 2018
 * Miguel Grinberg - Microservices with Python and Flask - PyCon 2017 (Youtube)
 * Deploy and scale containers with Docker native, open source orchestration PyCon 2017 (Youtube)
 * Miguel Grinberg - Flask at Scale - PyCon 2016 (Youtube)
 * Deploying and scaling applications with Docker, Swarm, and a tiny bit of Python magic - PyCon 2016 (Youtube)
 * Jérôme Petazzoni - Introduction to Docker and containers - PyCon 2016 (Youtube)
 * Miguel Grinberg - Flask Workshop - PyCon 2015 (Youtube)
 * Andrew T. Baker - Docker 101: Introduction to Docker - PyCon 2015 (Youtube)
 * Miguel Grinberg: Flask by Example - PyCon 2014 (Youtube)
 * Learn to Build Machine Learning Services, Prototype Real Applications, and Deploy your Work to Users - blog post
 * Deploying Keras Deep Learning Models with Flask - blog post
 * Introducing Flask-RESTful - blog post
 * Develop a NLP Model in Python & Deploy It with Flask, Step by Step - blog post
 * Deploying Machine Learning apps with Docker containers - MUPy 2017 - video
 * Getting started with Anaconda & Docker - blog post
 * Docker for Data Science - blog post
 * How Docker Can Help You Become A More Effective Data Scientist - blog post
 * Simplified Docker-ing for Data Science — Part 1 - blog post
 * Deep Learning Installation Tutorial - Part 4: How to install Docker for Deep Learning - blog post
 * How to write a production-level code in Data Science? - blog post
 * Web Access Logs in Elasticsearch and Machine Learning - webinar
 * Deploying Python models to production - video
 * How to deploy machine learning models into production - video
 * Putting Machine Learning Models into Production - blog post
 * productionML (GitHub) - code for creating Production level API services for Machine Learning
 * AI Tales: Building Machine learning pipeline using Kubeflow and Minio - blog post
 * Deep-Learning-in-Production (GitHub)
 * Deploy your AI model the hard (and robust) way - blog post