Difference between revisions of "Data Science"

From Ioannis Kourouklides
Jump to navigation Jump to search
 
(113 intermediate revisions by 3 users not shown)
Line 1: Line 1:
This page contains resources about [https://en.wikipedia.org/wiki/Data_science Data Science], including '''Data Engineering''' and [https://en.wikipedia.org/wiki/Data_management Data Management].
+
This page contains resources about [https://en.wikipedia.org/wiki/Data_science Data Science], '''Data Engineering''' and [https://en.wikipedia.org/wiki/Data_management Data Management].
   
 
== Subfields and Concepts ==
 
== Subfields and Concepts ==
  +
* Agile Data Science
 
* [[Machine Learning]] / Data Mining
 
* [[Machine Learning]] / Data Mining
* Exploratory Data Analysis
+
* Exploratory Data Analysis (EDA)
 
* Data Preparation and Data Preprocessing
 
* Data Preparation and Data Preprocessing
 
* Data Fusion and Data Integration
 
* Data Fusion and Data Integration
Line 10: Line 11:
 
* Data Sampling
 
* Data Sampling
 
* Data Cleaning
 
* Data Cleaning
* High Performance/Parallel/Distributed Computing for Machine Learning
 
* Concurrent/Multi-threading Computing for Machine Learning
 
* Data Engineering, Data Management and Databases
 
 
* Data Visualization
 
* Data Visualization
  +
* Explainable AI (XAI) / Interpretable AI
 
* Big Data
 
* Big Data
  +
* Data Engineering, Data Management and Databases
  +
* High Performance/Parallel/Distributed/Cloud Computing for Machine Learning
  +
* Concurrent/Multi-threading Computing for Machine Learning
  +
* Synchronous Communication (for Web Services)
  +
** Representational State Transfer (REST) Protocol
  +
** Remote Procedure Call (RPC)
  +
** Simple Object Access Protocol (SOAP)
  +
* Asynchronous Communication / Asynchronous Messaging (for Web Services)
  +
** Message broker/Message bus/Event bus/Integration broker/Interface engine
  +
** Message queue
  +
** Asynchronous protocols
  +
*** Advanced Message Queuing Protocol (AMQP)
  +
*** MQ Telemetry Transport (MQTT)
  +
* Messaging patterns
  +
** Fire-and-Forget / One-Way
  +
** Request-Response / Request-Reply
  +
** Publisher-Subscriber
  +
** Request-Callback
  +
* Software Architecture
  +
** Monolithic Architecture
  +
** Microservices Architecture
  +
** Service-Oriented Architecture (SOA)
  +
* Stream Processing
   
 
== Online courses ==
 
== Online courses ==
Line 20: Line 42:
 
=== Video Lectures ===
 
=== Video Lectures ===
 
* [https://www.coursera.org/learn/competitive-data-science How to Win a Data Science Competition: Learn from Top Kagglers] - Coursera
 
* [https://www.coursera.org/learn/competitive-data-science How to Win a Data Science Competition: Learn from Top Kagglers] - Coursera
* [https://www.youtube.com/watch?v=W5WE9Db2RLU Exploratory data analysis in Python by Chloe Mawer and Jonathan Whitmore] - PyCon 2017
 
   
 
=== Lecture Notes ===
 
=== Lecture Notes ===
* [https://www.slideshare.net/kourouklides/what-is-data-science-99294704/ What is Data Science by Ioannis Kourouklides]
+
* [https://goo.gl/VSTGUQ Data Science by Ioannis Kourouklides]
 
* [https://www.csie.ntu.edu.tw/~cjlin/talks/bigdata-bilbao.pdf When <nowiki> [to use] </nowiki> and When Not to Use Distributed Machine Learning by Chih-Jen Lin]
 
* [https://www.csie.ntu.edu.tw/~cjlin/talks/bigdata-bilbao.pdf When <nowiki> [to use] </nowiki> and When Not to Use Distributed Machine Learning by Chih-Jen Lin]
 
* [https://medium.com/open-machine-learning-course/open-machine-learning-course-topic-1-exploratory-data-analysis-with-pandas-de57880f1a68 Open Machine Learning Course] (Medium)
 
* [https://medium.com/open-machine-learning-course/open-machine-learning-course-topic-1-exploratory-data-analysis-with-pandas-de57880f1a68 Open Machine Learning Course] (Medium)
Line 31: Line 52:
   
 
==Books==
 
==Books==
  +
* Newman, S. (2021). ''Building Microservices: Designing Fine-Grained Systems''. 2nd Ed. O'Reilly Media.
  +
* Bellemare, A. (2020). ''Building Event-Driven Microservices: Leveraging Organizational Data at Scale''. O'Reilly Media.
  +
* Richards, M. (2020). ''Fundamentals of Software Architecture''. O'Reilly Media.
  +
* Dean A., & Crettaz, V. (2019). ''Event Streams in Action''. Manning.
  +
* Richardson, C. (2018). ''Microservices Patterns''. Manning Publications.
  +
* Pacheco, V. F. (2018). ''Microservice Patterns and Best Practices''. Packt Publishing.
  +
* De la Torre C., Wagner, B., & Rousos, M. (2018). ''.NET Microservices: Architecture for Containerized .NET Applications''. Microsoft Corporation. ([https://github.com/dzfweb/microsoft-microservices-book link])
  +
* Lanaro, G. (2017). ''Python High Performance''. Packt Publishing.
 
* Wickham, H., & Grolemund, G. (2017). ''R for Data Science''. O'Reilly Media.
 
* Wickham, H., & Grolemund, G. (2017). ''R for Data Science''. O'Reilly Media.
  +
* Kleppmann, M. (2017). ''Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems''. O'Reilly Media.
 
* VanderPlas, J. (2016). ''Python Data Science Handbook: Essential Tools for Working with Data''. O'Reilly Media.
 
* VanderPlas, J. (2016). ''Python Data Science Handbook: Essential Tools for Working with Data''. O'Reilly Media.
  +
* Pierfederici, F. (2016). ''Distributed Computing with Python''. Packt Publishing.
  +
* Dunning, T., & Friedman, E. (2016). ''Streaming Architecture: New Designs Using Apache Kafka and MapR Streams.'' O'Reilly Media.
 
* Nolan, D., & Lang, D. T. (2015). ''Data Science in R: A Case Studies Approach to Computational Reasoning and Problem Solving''. CRC Press.
 
* Nolan, D., & Lang, D. T. (2015). ''Data Science in R: A Case Studies Approach to Computational Reasoning and Problem Solving''. CRC Press.
 
* Elston, S. F. (2015). ''Data Science in the Cloud with Microsoft Azure Machine Learning and R.'' O'Reilly Media, Inc.
 
* Elston, S. F. (2015). ''Data Science in the Cloud with Microsoft Azure Machine Learning and R.'' O'Reilly Media, Inc.
 
* Grus, J. (2015). ''Data Science from Scratch: First Principles with Python''. O'Reilly Media.
 
* Grus, J. (2015). ''Data Science from Scratch: First Principles with Python''. O'Reilly Media.
* Madhavan, S. (2015). ''Mastering Python for Data Science''. Packt Publishing Ltd.
+
* Madhavan, S. (2015). ''Mastering Python for Data Science''. Packt Publishing.
  +
* Kale, V. (2015). ''Guide to Cloud Computing for Business and Technology Managers: From Distributed Computing to Cloudware Applications''. CRC Press.
* Leskovec, J., Rajaraman, A., & Ullman, J. D. (2014). Mining of massive datasets. Cambridge University Press. ([http://www.mmds.org/ link])
 
  +
* Ejsmont, A. (2015). ''Web Scalability for Startup Engineers''. McGraw Hill.
* Zumel, N., Mount, J., & Porzak, J. (2014). ''Practical data science with R''. Manning.
 
  +
* Leskovec, J., Rajaraman, A., & Ullman, J. D. (2014). ''Mining of Massive Datasets''. Cambridge University Press. ([http://www.mmds.org/ link])
* Schutt, R., & O'Neil, C. (2013). ''Doing data science: Straight talk from the frontline''. O'Reilly Media.
 
* Tukey, J. W. (1977). ''Exploratory data analysis''. Addison-Wesley.
+
* Zumel, N., Mount, J., & Porzak, J. (2014). ''Practical Data Science with R''. Manning.
  +
* Schutt, R., & O'Neil, C. (2013). ''Doing Data Science: Straight Talk from the Frontline''. O'Reilly Media.
  +
* Videla, A., & J.W. Williams, J. (2012). ''RabbitMQ in Action''. Manning.
  +
* Tukey, J. W. (1977). ''Exploratory Data Analysis''. Addison-Wesley.
   
 
==Scholarly Articles==
 
==Scholarly Articles==
  +
* Verbraeken, J., Wolting, M., Katzy, J., Kloppenburg, J., Verbelen, T., & Rellermeyer, J. S. (2020). A Survey on Distributed Machine Learning. ''ACM Computing Surveys (CSUR), 53''(2), 1-33.
  +
* Buchlovsky, P. ... (2018). TF-Replicator: Distributed Machine Learning for Researchers. ''arXiv preprint arXiv:1902.00465.''
 
* Kang, D., Emmons, J., Abuzaid, F., Bailis, P., & Zaharia, M. (2017). NoScope: optimizing neural network queries over video at scale. ''Proceedings of the VLDB Endowment, 10''(11), 1586-1597.
 
* Kang, D., Emmons, J., Abuzaid, F., Bailis, P., & Zaharia, M. (2017). NoScope: optimizing neural network queries over video at scale. ''Proceedings of the VLDB Endowment, 10''(11), 1586-1597.
  +
* Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). Why Should I Trust You? Explaining the Predictions of Any Classifier. In ''Proceedings of the 22nd [https://en.wikipedia.org/wiki/SIGKDD ACM SIGKDD International Conference on Knowledge Discovery and Data Mining]'' (pp. 1135-1144).
 
* Xing, E. P., Ho, Q., Xie, P., & Wei, D. (2016). Strategies and principles of distributed machine learning on big data. ''Engineering, 2''(2), 179-195.
 
* Xing, E. P., Ho, Q., Xie, P., & Wei, D. (2016). Strategies and principles of distributed machine learning on big data. ''Engineering, 2''(2), 179-195.
 
* Salloum, S., Dautov, R., Chen, X., Peng, P. X., & Huang, J. Z. (2016). Big data analytics on Apache Spark. ''International Journal of Data Science and Analytics, 1''(3-4), 145-164.
 
* Salloum, S., Dautov, R., Chen, X., Peng, P. X., & Huang, J. Z. (2016). Big data analytics on Apache Spark. ''International Journal of Data Science and Analytics, 1''(3-4), 145-164.
 
* Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., ... & Dennison, D. (2015). Hidden technical debt in machine learning systems. In ''[https://en.wikipedia.org/wiki/Conference_on_Neural_Information_Processing_Systems Advances in Neural Information Processing Systems]'' (pp. 2503-2511).
 
* Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., ... & Dennison, D. (2015). Hidden technical debt in machine learning systems. In ''[https://en.wikipedia.org/wiki/Conference_on_Neural_Information_Processing_Systems Advances in Neural Information Processing Systems]'' (pp. 2503-2511).
* Huang, Y., Zhu, F., Yuan, M., Deng, K., Li, Y., Ni, B., ... & Zeng, J. (2015). Telco Churn Prediction with Big Data. In ''Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data'' (pp. 607-618). ACM.
+
* Huang, Y., Zhu, F., Yuan, M., Deng, K., Li, Y., Ni, B., ... & Zeng, J. (2015). Telco Churn Prediction with Big Data. In ''Proceedings of the 2015 [https://en.wikipedia.org/wiki/SIGMOD ACM SIGMOD International Conference on Management of Data]'' (pp. 607-618).
 
* Moritz, P., Nishihara, R., Stoica, I., & Jordan, M. I. (2015). SparkNet: Training Deep Networks in Spark. ''arXiv preprint arXiv:1511.06051''.
 
* Moritz, P., Nishihara, R., Stoica, I., & Jordan, M. I. (2015). SparkNet: Training Deep Networks in Spark. ''arXiv preprint arXiv:1511.06051''.
 
* Upadhyaya, S. R. (2013). Parallel approaches to machine learning—A comprehensive survey. ''Journal of Parallel and Distributed Computing, 73''(3), 284-292.
 
* Upadhyaya, S. R. (2013). Parallel approaches to machine learning—A comprehensive survey. ''Journal of Parallel and Distributed Computing, 73''(3), 284-292.
Line 65: Line 103:
 
* [https://docs.python.org/3.4/library/threading.html threading] - Python
 
* [https://docs.python.org/3.4/library/threading.html threading] - Python
 
* [https://github.com/ClimbsRocks/auto_ml auto_ml] - Python
 
* [https://github.com/ClimbsRocks/auto_ml auto_ml] - Python
  +
* [https://docs.celeryproject.org/en/stable/getting-started/introduction.html Celery] - Python
 
* [https://www.elastic.co/products/elasticsearch Elasticsearch], [https://www.elastic.co/products/logstash Logstash], [https://www.elastic.co/products/kibana Kibana] (ELK)
 
* [https://www.elastic.co/products/elasticsearch Elasticsearch], [https://www.elastic.co/products/logstash Logstash], [https://www.elastic.co/products/kibana Kibana] (ELK)
 
* [https://www.mongodb.com/ MongoDB]
 
* [https://www.mongodb.com/ MongoDB]
Line 72: Line 111:
 
* [https://spark.apache.org/ Apache Spark]
 
* [https://spark.apache.org/ Apache Spark]
 
* [https://hive.apache.org/ Apache Hive]
 
* [https://hive.apache.org/ Apache Hive]
* [http://kafka.apache.org/ Apache Kafka], which includes [https://www.confluent.io/product/connectors/ Kafka Connect]
 
 
* [http://cassandra.apache.org/ Apache Cassandra]
 
* [http://cassandra.apache.org/ Apache Cassandra]
 
* [https://zookeeper.apache.org/ Apache ZooKeeper]
 
* [https://zookeeper.apache.org/ Apache ZooKeeper]
Line 79: Line 117:
 
* [http://couchdb.apache.org/ Apache CouchDB]
 
* [http://couchdb.apache.org/ Apache CouchDB]
 
* [http://activemq.apache.org/ Apache ActiveMQ]
 
* [http://activemq.apache.org/ Apache ActiveMQ]
* [https://www.rabbitmq.com/ RabbitMQ]
+
* [http://samza.apache.org/ Apache Samza]
  +
* [https://flink.apache.org/ Apache Flink]
  +
* [http://kafka.apache.org/ Apache Kafka] (which includes [https://www.confluent.io/product/connectors/ Kafka Connect]) - A message broker
  +
* [https://www.rabbitmq.com/ RabbitMQ] - A message broker
  +
* [https://redis.io/ Redis] - A message broker
 
* [https://spark.apache.org/docs/latest/api/python/index.html pyspark] - Spark Python API
 
* [https://spark.apache.org/docs/latest/api/python/index.html pyspark] - Spark Python API
 
* [http://platanios.org/tensorflow_scala/ tensorflow_scala] - Scala API for TensorFlow
 
* [http://platanios.org/tensorflow_scala/ tensorflow_scala] - Scala API for TensorFlow
Line 85: Line 127:
 
* [https://github.com/yahoo/TensorFlowOnSpark TensorFlowOnSpark] - It brings TensorFlow programs onto Apache Spark clusters
 
* [https://github.com/yahoo/TensorFlowOnSpark TensorFlowOnSpark] - It brings TensorFlow programs onto Apache Spark clusters
 
* [https://numba.pydata.org/ Numba] - Python
 
* [https://numba.pydata.org/ Numba] - Python
  +
* [https://graphql.org/ GraphQL]
  +
* [https://www.nginx.com/ nginx]
  +
* [https://dvc.org/ DVC] - Data Version Control
  +
* [https://www.kubeflow.org/ kubeflow]
  +
* [https://akka.io/ Akka]
  +
* [https://www.pykka.org/ Pykka]
  +
* [https://apache.github.io/incubator-heron/ Heron]
  +
* [https://airflow.apache.org/ Apache Airflow] - Workflow Management System
  +
* [http://druid.io/ Druid]
  +
* [https://superset.incubator.apache.org/druid.html Apache Superset]
  +
* [https://github.com/horovod/horovod Horovod] - TensorFlow, Keras, PyTorch, and MXNet
  +
* [https://www.acumos.org/ Acumos AI]
  +
* [https://hopsworks.readthedocs.io/en/0.9/hopsml/hopsML.html HopsML]
  +
* [https://arrow.apache.org/ Apache Arrow]
   
 
==See also==
 
==See also==
Line 91: Line 147:
 
==Other Resources==
 
==Other Resources==
 
===General===
 
===General===
  +
*[https://www.slideshare.net/kourouklides/what-is-data-science-99294704/ What is Data Science by Ioannis Kourouklides] - slides
 
*[https://datascienceguide.github.io/ Data Science Guide]
 
*[https://datascienceguide.github.io/ Data Science Guide]
 
*[http://jadianes.me/data-science-your-way/ Data Science Engineering, your way]
 
*[http://jadianes.me/data-science-your-way/ Data Science Engineering, your way]
 
*[https://www.oreilly.com/ideas/a-manifesto-for-agile-data-science A manifesto for Agile data science] - blog post
 
*[https://www.oreilly.com/ideas/a-manifesto-for-agile-data-science A manifesto for Agile data science] - blog post
  +
*[https://towardsdatascience.com/data-science-project-flow-for-startups-282a93d4508d Data Science Project Flow for Startups] - blog post
 
*[http://www.cse.ust.hk/~kxmo/LargeML.html Large Scale Machine Learning] - libraries and papers
 
*[http://www.cse.ust.hk/~kxmo/LargeML.html Large Scale Machine Learning] - libraries and papers
 
*[https://www.quora.com/What-are-some-courses-on-large-scale-learning What are some courses on large scale learning?] - Quora
 
*[https://www.quora.com/What-are-some-courses-on-large-scale-learning What are some courses on large scale learning?] - Quora
Line 99: Line 157:
 
*[https://www.kdnuggets.com/2017/12/baesens-web-scraping-data-science-python.html Web Scraping for Data Science with Python] - blog post
 
*[https://www.kdnuggets.com/2017/12/baesens-web-scraping-data-science-python.html Web Scraping for Data Science with Python] - blog post
 
*[http://vlad17.github.io/COS513-Blog/ Princeton Commodities Modeling Blog]
 
*[http://vlad17.github.io/COS513-Blog/ Princeton Commodities Modeling Blog]
*[https://github.com/ajaymache/data-analysis-using-python Exploratory data analysis using Python for used car database taken from Kaggle] - Github
 
*[https://www.kaggle.com/ekami66/detailed-exploratory-data-analysis-with-python Detailed exploratory data analysis with Python] - Kaggle
 
 
*[https://github.com/upalr/Python-camp Python-camp] - Github
 
*[https://github.com/upalr/Python-camp Python-camp] - Github
*[https://medium.com/open-machine-learning-course/open-machine-learning-course-topic-1-exploratory-data-analysis-with-pandas-de57880f1a68 Exploratory Data Analysis with Pandas] - blog post
 
*[https://www.kaggle.com/randylaosat/simple-exploratory-data-analysis-passnyc Simple Exploratory Data Analysis - PASSNYC] - Kaggle
 
*[https://www.kaggle.com/moizzz/eda-and-clustering EDA and Clustering] - Kaggle
 
*[https://medium.com/python-pandemonium/introduction-to-exploratory-data-analysis-in-python-8b6bcb55c190 Introduction to Exploratory Data Analysis in Python] - blog post
 
 
*[http://mtitek.com/big-data.php Big Data: Spark, Hadoop, Hive, ZooKeeper, Solr, Kafka, Nutch, MongoDB, ...] - installation instructions
 
*[http://mtitek.com/big-data.php Big Data: Spark, Hadoop, Hive, ZooKeeper, Solr, Kafka, Nutch, MongoDB, ...] - installation instructions
 
*[https://databricks.com/blog/2016/01/25/deep-learning-with-apache-spark-and-tensorflow.html Deep Learning with Apache Spark and TensorFlow] - blog post
 
*[https://databricks.com/blog/2016/01/25/deep-learning-with-apache-spark-and-tensorflow.html Deep Learning with Apache Spark and TensorFlow] - blog post
 
*[https://khartig.wordpress.com/2017/12/30/build-a-simple-chatbot-with-tensorflow-python-and-mongodb/ Build a Simple Chatbot with Tensorflow, Python and MongoDB] - blog post
 
*[https://khartig.wordpress.com/2017/12/30/build-a-simple-chatbot-with-tensorflow-python-and-mongodb/ Build a Simple Chatbot with Tensorflow, Python and MongoDB] - blog post
*[https://medium.com/open-machine-learning-course/open-machine-learning-course-topic-2-visual-data-analysis-in-python-846b989675cd Visual Data Analysis with Python] - blog post
 
 
*[https://plot.ly/python/maps/ Plotly Python Library Maps]
 
*[https://plot.ly/python/maps/ Plotly Python Library Maps]
 
*[https://towardsdatascience.com/5-quick-and-easy-data-visualizations-in-python-with-code-a2284bae952f 5 Quick and Easy Data Visualizations in Python with Code] - blog post
 
*[https://towardsdatascience.com/5-quick-and-easy-data-visualizations-in-python-with-code-a2284bae952f 5 Quick and Easy Data Visualizations in Python with Code] - blog post
 
*[https://medium.com/@williamkoehrsen William Koehrsen] - blog
 
*[https://medium.com/@williamkoehrsen William Koehrsen] - blog
 
*[http://www.claoudml.co/ ClaoudML] - Free Data Science & Machine Learning Resources
 
*[http://www.claoudml.co/ ClaoudML] - Free Data Science & Machine Learning Resources
*[https://medium.com/@Petuum/intro-to-distributed-deep-learning-systems-a2e45c6b8e7 Intro to Distributed Deep Learning Systems] - blog post
 
*[https://www.systems.ethz.ch/sites/default/files/parallel-distributed-deep-learning.pdf Parallel and Distributed Deep Learning by Tal Ben-Nun]
 
*[https://sebastianraschka.com/Articles/2014_multiprocessing.html An introduction to parallel programming using Python's multiprocessing module] - blog post
 
*[https://www.kdnuggets.com/2015/12/spark-deep-learning-training-with-sparknet.html Spark + Deep Learning: Distributed Deep Neural Network Training with SparkNet] - blog post
 
 
*[https://www.datasciencecentral.com/profiles/blogs/data-science-in-python-pandas-cheat-sheet Data Science in Python: Pandas Cheat Sheet]
 
*[https://www.datasciencecentral.com/profiles/blogs/data-science-in-python-pandas-cheat-sheet Data Science in Python: Pandas Cheat Sheet]
 
*[https://www.elastic.co/webinars/time-series-anomaly-detection-optimizing-machine-learning-jobs-in-elasticsearch Time Series Anomaly Detection: Optimizing your Machine Learning Jobs in Elasticsearch] - webinar
 
*[https://www.elastic.co/webinars/time-series-anomaly-detection-optimizing-machine-learning-jobs-in-elasticsearch Time Series Anomaly Detection: Optimizing your Machine Learning Jobs in Elasticsearch] - webinar
 
*[https://ai.googleblog.com/2017/04/federated-learning-collaborative.html Federated Learning: Collaborative Machine Learning without Centralized Training Data] - blog post
 
*[https://ai.googleblog.com/2017/04/federated-learning-collaborative.html Federated Learning: Collaborative Machine Learning without Centralized Training Data] - blog post
  +
*[https://www.zurich.ibm.com/snapml/ Snap ML] - IBM
 
*[https://github.com/vsmolyakov/pyspark pyspark (GitHub)] - collection of resources
 
*[https://github.com/vsmolyakov/pyspark pyspark (GitHub)] - collection of resources
* [https://github.com/tmulc18/Distributed-TensorFlow-Guide Distributed-TensorFlow-Guide (GitHub)] - Distributed TensorFlow basics and examples of training algorithms (with code)
 
*[https://github.com/kaiwaehner/kafka-streams-machine-learning-examples kafka-streams-machine-learning-examples (GitHub)] - Machine Learning + Kafka Streams Examples (with code)
 
*[https://aseigneurin.github.io/2018/09/05/realtime-machine-learning-predictions-wth-kafka-and-h2o.html Realtime Machine Learning predictions with Kafka and H2O.ai] - blog post
 
 
*[https://becominghuman.ai/cheat-sheets-for-ai-neural-networks-machine-learning-deep-learning-big-data-678c51b4b463 Cheat Sheets for AI, Neural Networks, Machine Learning, Deep Learning & Big Data] - blog post
 
*[https://becominghuman.ai/cheat-sheets-for-ai-neural-networks-machine-learning-deep-learning-big-data-678c51b4b463 Cheat Sheets for AI, Neural Networks, Machine Learning, Deep Learning & Big Data] - blog post
 
*[https://eng.uber.com/peloton/ Peloton: Uber’s Unified Resource Scheduler for Diverse Cluster Workloads] - blog post
 
*[https://eng.uber.com/peloton/ Peloton: Uber’s Unified Resource Scheduler for Diverse Cluster Workloads] - blog post
 
*[http://martin.zinkevich.org/rules_of_ml/rules_of_ml.pdf Rules of Machine Learning: Best Practices for ML Engineering] - blog post
 
*[http://martin.zinkevich.org/rules_of_ml/rules_of_ml.pdf Rules of Machine Learning: Best Practices for ML Engineering] - blog post
 
*[https://blog.kovalevskyi.com/google-compute-engine-now-has-images-with-pytorch-1-0-0-and-fastai-1-0-2-57c49efd74bb Google Compute Engine Now Has Images With PyTorch 1.0.0 and FastAi 1.0.2] - blog post
 
*[https://blog.kovalevskyi.com/google-compute-engine-now-has-images-with-pytorch-1-0-0-and-fastai-1-0-2-57c49efd74bb Google Compute Engine Now Has Images With PyTorch 1.0.0 and FastAi 1.0.2] - blog post
*[https://eng.uber.com/michelangelo-pyml/ Michelangelo PyML: Introducing Uber’s Platform for Rapid Python ML Model Development
+
*[https://eng.uber.com/michelangelo-pyml/ Michelangelo PyML: Introducing Uber’s Platform for Rapid Python ML Model Development]
*[https://towardsdatascience.com/manage-your-data-science-project-structure-in-early-stage-95f91d4d0600 Manage your Data Science project structure in early stage.] - blog post
+
*[https://towardsdatascience.com/manage-your-data-science-project-structure-in-early-stage-95f91d4d0600 Manage your Data Science project structure in early stage] - blog post
 
*[https://medium.com/@rrfd/cookiecutter-data-science-organize-your-projects-atom-and-jupyter-2be7862f487e Cookiecutter Data Science — Organize your Projects — Atom and Jupyter] - blog post
 
*[https://medium.com/@rrfd/cookiecutter-data-science-organize-your-projects-atom-and-jupyter-2be7862f487e Cookiecutter Data Science — Organize your Projects — Atom and Jupyter] - blog post
  +
*[https://github.com/SurrealAI/surreal surreal (GitHub)] - code
  +
*[https://github.com/SurrealAI/cloudwise cloudwise (GitHub)] - code
  +
*[https://github.com/SurrealAI/caraml caraml (GitHub)] - code
  +
*[https://github.com/SurrealAI/symphony symphony (GitHub)] - code
  +
*[https://www.analyticsindiamag.com/tensorflow-vs-spark-differ-work-tandem TensorFlow Vs. Spark: How Do They Differ And Work In Tandem With Each Other] - blog post
  +
*[https://github.com/bulutyazilim/awesome-datascience awesome-datascience (GitHub)]
  +
*[https://github.com/siboehm/awesome-learn-datascience awesome-learn-datascience (GitHub)]
  +
*[https://www.logicalclocks.com/blog/when-deep-learning-with-gpus-use-a-cluster-manager When Deep Learning with GPUs, use a Cluster Manager] - blog post
  +
  +
===Data Annotation & Labelling===
  +
*[https://appen.com/blog/data-annotation/ What is Data Annotation?]
  +
*[https://www.mturk.com Amazon Mechanical Turk]
  +
*[https://www.cloudfactory.com/ CloudFactory]
  +
*[https://appen.com/ Appen]
  +
*[https://www.alegion.com/ Alegion]
  +
*[https://imerit.net/ iMerit]
  +
*[https://playment.io/ Playment]
  +
*[https://www.rev.com/ Rev] - Transcription from video and audio
  +
*[https://labelbox.com/ Labelbox]
  +
*[https://github.com/diffgram/diffgram diffgram]
  +
*[https://dl.acm.org/citation.cfm?id=1866696 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk]
  +
*[https://www.cloudfactory.com/data-annotation-tool-guide Data Annotation Tools for Machine Learning (Evolving Guide)]
  +
*[https://github.com/taivop/awesome-data-annotation awesome-data-annotation (GitHub)]
  +
  +
===EDA===
  +
*[https://github.com/ajaymache/data-analysis-using-python Exploratory data analysis using Python for used car database taken from Kaggle] - Github
  +
*[https://www.kaggle.com/ekami66/detailed-exploratory-data-analysis-with-python Detailed exploratory data analysis with Python] - Kaggle
  +
*[https://www.youtube.com/watch?v=W5WE9Db2RLU Exploratory data analysis in Python - PyCon 2017 (Youtube)]
  +
*[https://medium.com/open-machine-learning-course/open-machine-learning-course-topic-1-exploratory-data-analysis-with-pandas-de57880f1a68 Exploratory Data Analysis with Pandas] - blog post
  +
*[https://www.kaggle.com/randylaosat/simple-exploratory-data-analysis-passnyc Simple Exploratory Data Analysis - PASSNYC] - Kaggle
  +
*[https://www.kaggle.com/moizzz/eda-and-clustering EDA and Clustering] - Kaggle
  +
*[https://medium.com/python-pandemonium/introduction-to-exploratory-data-analysis-in-python-8b6bcb55c190 Introduction to Exploratory Data Analysis in Python] - blog post
  +
*[https://medium.com/open-machine-learning-course/open-machine-learning-course-topic-2-visual-data-analysis-in-python-846b989675cd Visual Data Analysis with Python] - blog post
  +
  +
===Asynchronous Communication & Microservices===
  +
*[https://microservices.io/patterns/microservices.html Pattern: Microservice Architecture]
  +
*[https://www.dineshonjava.com/software-architecture-patterns-and-designs/ Software Architecture Patterns and Designs]
  +
*[https://codeblog.dotsandbrackets.com/asynchronous-communication-with-message-queue/ Asynchronous communication with message queue]
  +
*[https://garba.org/article/general/soa/mep.html Message Exchange Patterns (MEPs)]
  +
*[https://flylib.com/books/en/2.365.1/message_exchange_patterns.html Message exchange patterns]
  +
*[https://docs.microsoft.com/en-us/azure/architecture/patterns/category/messaging Messaging patterns]
  +
*[https://medium.com/@mmz.zaeimi/synchronous-vs-asynchronous-communication-in-microservices-integration-f4dd36478fd2 Synchronous vs Asynchronous communication in microservices integration]
  +
*[https://otonomo.io/blog/redis-kafka-or-rabbitmq-which-microservices-message-broker-to-choose/ Redis, Kafka or RabbitMQ: Which MicroServices Message Broker To Choose?]
  +
*[https://dzone.com/articles/akka-streams-and-kafka-streams-where-microservices Akka Streams and Kafka Streams: Where Microservices Meet Fast Data]
  +
*[https://dzone.com/articles/akka-spark-or-kafka-selecting-the-right-streaming Akka, Spark, or Kafka? Selecting the Right Streaming Engine]
  +
*[https://otonomo.io/blog/luigi-airflow-pinball-and-chronos-comparing-workflow-management-systems/ Luigi, Airflow, Pinball, and Chronos: Comparing Workflow Management Systems]
  +
*[https://github.com/kaiwaehner/kafka-streams-machine-learning-examples kafka-streams-machine-learning-examples (GitHub)] - Machine Learning + Kafka Streams Examples (with code)
  +
*[https://aseigneurin.github.io/2018/09/05/realtime-machine-learning-predictions-wth-kafka-and-h2o.html Realtime Machine Learning predictions with Kafka and H2O.ai] - blog post
  +
*[https://tanzu.vmware.com/content/blog/understanding-when-to-use-rabbitmq-or-apache-kafka Understanding When to use RabbitMQ or Apache Kafka]
  +
*[https://www.ververica.com/what-is-stream-processing What is Stream Processing?]
  +
*[https://medium.com/stream-processing/what-is-stream-processing-1eadfca11b97 A Gentle Introduction to Stream Processing]
  +
  +
=== Distributed Systems===
  +
*[https://blog.docker.com/2016/10/docker-distributed-system-summit-videos-podcast-episodes/ Docker Distributed System Summit videos podcast episodes]
  +
*[https://www.voltdb.com/files/using-docker-simplify-distributed-systems-development/ Using Docker to Simplify Distributed Systems in Development] - video
  +
*[https://medium.com/@harinilabs/day-11-getting-started-with-docker-and-using-it-to-build-deploy-a-distributed-app-1929669064b8 Day 11: Using Docker to build and deploy a distributed app] - blog post with [https://github.com/harinij/100DaysOfCode/tree/master/Day%20011%20-%20Docker%20WebApp code]
  +
*[https://medium.com/@Petuum/intro-to-distributed-deep-learning-systems-a2e45c6b8e7 Intro to Distributed Deep Learning Systems] - blog post
  +
*[https://www.systems.ethz.ch/sites/default/files/parallel-distributed-deep-learning.pdf Parallel and Distributed Deep Learning by Tal Ben-Nun]
  +
*[https://sebastianraschka.com/Articles/2014_multiprocessing.html An introduction to parallel programming using Python's multiprocessing module] - blog post
  +
*[https://www.kdnuggets.com/2015/12/spark-deep-learning-training-with-sparknet.html Spark + Deep Learning: Distributed Deep Neural Network Training with SparkNet] - blog post
  +
*[http://muratbuffalo.blogspot.com/2016/04/petuum-new-platform-for-distributed.html Paper Review. Petuum: A new platform for distributed machine learning on big data] - blog post
  +
*[http://www.cheerml.com/comparison-distributed-ml-platform A comparison of distributed machine learning platform] - blog post
  +
*[https://www.logicalclocks.com/why-you-need-a-distributed-filesystem-for-deep-learning/ Distributed Filesystems for Deep Learning] - blog post
  +
*[https://github.com/tmulc18/Distributed-TensorFlow-Guide Distributed-TensorFlow-Guide (GitHub)] - Distributed TensorFlow basics and examples of training algorithms (with code)
   
 
===Deployment and Production===
 
===Deployment and Production===
Line 169: Line 278:
 
*[https://blog.cambridgespark.com/putting-machine-learning-models-into-production-d768560907bd Putting Machine Learning Models into Production] - blog post
 
*[https://blog.cambridgespark.com/putting-machine-learning-models-into-production-d768560907bd Putting Machine Learning Models into Production] - blog post
 
*[https://github.com/practicalAI/productionML productionML (GitHub)] - code for creating Production level API services for Machine Learning
 
*[https://github.com/practicalAI/productionML productionML (GitHub)] - code for creating Production level API services for Machine Learning
  +
*[https://medium.com/kredaro-engineering/ai-tales-building-machine-learning-pipeline-using-kubeflow-and-minio-4b88da30437b AI Tales: Building Machine learning pipeline using Kubeflow and Minio] - blog post
  +
*[https://github.com/ahkarami/Deep-Learning-in-Production Deep-Learning-in-Production (GitHub)]
  +
*[https://medium.com/dataswati-garage/create-a-robust-ai-rest-api-71a8050ce314 Deploy your AI model the hard (and robust) way] - blog post

Latest revision as of 21:35, 24 November 2020

This page contains resources about Data Science, Data Engineering and Data Management.

Subfields and Concepts[edit]

  • Agile Data Science
  • Machine Learning / Data Mining
  • Exploratory Data Analysis (EDA)
  • Data Preparation and Data Preprocessing
  • Data Fusion and Data Integration
  • Data Wrangling / Data Munging
  • Data Scraping
  • Data Sampling
  • Data Cleaning
  • Data Visualization
  • Explainable AI (XAI) / Interpretable AI
  • Big Data
  • Data Engineering, Data Management and Databases
  • High Performance/Parallel/Distributed/Cloud Computing for Machine Learning
  • Concurrent/Multi-threading Computing for Machine Learning
  • Synchronous Communication (for Web Services)
    • Representational State Transfer (REST) Protocol
    • Remote Procedure Call (RPC)
    • Simple Object Access Protocol (SOAP)
  • Asynchronous Communication / Asynchronous Messaging (for Web Services)
    • Message broker/Message bus/Event bus/Integration broker/Interface engine
    • Message queue
    • Asynchronous protocols
      • Advanced Message Queuing Protocol (AMQP)
      • MQ Telemetry Transport (MQTT)
  • Messaging patterns
    • Fire-and-Forget / One-Way
    • Request-Response / Request-Reply
    • Publisher-Subscriber
    • Request-Callback
  • Software Architecture
    • Monolithic Architecture
    • Microservices Architecture
    • Service-Oriented Architecture (SOA)
  • Stream Processing

Online courses[edit]

Video Lectures[edit]

Lecture Notes[edit]

Books[edit]

  • Newman, S. (2021). Building Microservices: Designing Fine-Grained Systems. 2nd Ed. O'Reilly Media.
  • Bellemare, A. (2020). Building Event-Driven Microservices: Leveraging Organizational Data at Scale. O'Reilly Media.
  • Richards, M. (2020). Fundamentals of Software Architecture. O'Reilly Media.
  • Dean A., & Crettaz, V. (2019). Event Streams in Action. Manning.
  • Richardson, C. (2018). Microservices Patterns. Manning Publications.
  • Pacheco, V. F. (2018). Microservice Patterns and Best Practices. Packt Publishing.
  • De la Torre C., Wagner, B., & Rousos, M. (2018). .NET Microservices: Architecture for Containerized .NET Applications. Microsoft Corporation. (link)
  • Lanaro, G. (2017). Python High Performance. Packt Publishing.
  • Wickham, H., & Grolemund, G. (2017). R for Data Science. O'Reilly Media.
  • Kleppmann, M. (2017). Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. O'Reilly Media.
  • VanderPlas, J. (2016). Python Data Science Handbook: Essential Tools for Working with Data. O'Reilly Media.
  • Pierfederici, F. (2016). Distributed Computing with Python. Packt Publishing.
  • Dunning, T., & Friedman, E. (2016). Streaming Architecture: New Designs Using Apache Kafka and MapR Streams. O'Reilly Media.
  • Nolan, D., & Lang, D. T. (2015). Data Science in R: A Case Studies Approach to Computational Reasoning and Problem Solving. CRC Press.
  • Elston, S. F. (2015). Data Science in the Cloud with Microsoft Azure Machine Learning and R. O'Reilly Media, Inc.
  • Grus, J. (2015). Data Science from Scratch: First Principles with Python. O'Reilly Media.
  • Madhavan, S. (2015). Mastering Python for Data Science. Packt Publishing.
  • Kale, V. (2015). Guide to Cloud Computing for Business and Technology Managers: From Distributed Computing to Cloudware Applications. CRC Press.
  • Ejsmont, A. (2015). Web Scalability for Startup Engineers. McGraw Hill.
  • Leskovec, J., Rajaraman, A., & Ullman, J. D. (2014). Mining of Massive Datasets. Cambridge University Press. (link)
  • Zumel, N., Mount, J., & Porzak, J. (2014). Practical Data Science with R. Manning.
  • Schutt, R., & O'Neil, C. (2013). Doing Data Science: Straight Talk from the Frontline. O'Reilly Media.
  • Videla, A., & J.W. Williams, J. (2012). RabbitMQ in Action. Manning.
  • Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley.

Scholarly Articles[edit]

  • Verbraeken, J., Wolting, M., Katzy, J., Kloppenburg, J., Verbelen, T., & Rellermeyer, J. S. (2020). A Survey on Distributed Machine Learning. ACM Computing Surveys (CSUR), 53(2), 1-33.
  • Buchlovsky, P. ... (2018). TF-Replicator: Distributed Machine Learning for Researchers. arXiv preprint arXiv:1902.00465.
  • Kang, D., Emmons, J., Abuzaid, F., Bailis, P., & Zaharia, M. (2017). NoScope: optimizing neural network queries over video at scale. Proceedings of the VLDB Endowment, 10(11), 1586-1597.
  • Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). Why Should I Trust You? Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1135-1144).
  • Xing, E. P., Ho, Q., Xie, P., & Wei, D. (2016). Strategies and principles of distributed machine learning on big data. Engineering, 2(2), 179-195.
  • Salloum, S., Dautov, R., Chen, X., Peng, P. X., & Huang, J. Z. (2016). Big data analytics on Apache Spark. International Journal of Data Science and Analytics, 1(3-4), 145-164.
  • Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., ... & Dennison, D. (2015). Hidden technical debt in machine learning systems. In Advances in Neural Information Processing Systems (pp. 2503-2511).
  • Huang, Y., Zhu, F., Yuan, M., Deng, K., Li, Y., Ni, B., ... & Zeng, J. (2015). Telco Churn Prediction with Big Data. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (pp. 607-618).
  • Moritz, P., Nishihara, R., Stoica, I., & Jordan, M. I. (2015). SparkNet: Training Deep Networks in Spark. arXiv preprint arXiv:1511.06051.
  • Upadhyaya, S. R. (2013). Parallel approaches to machine learning—A comprehensive survey. Journal of Parallel and Distributed Computing, 73(3), 284-292.
  • Sakr, S., Liu, A., Batista, D. M., & Alomari, M. (2011). A survey of large scale data management approaches in cloud environments. IEEE Communications Surveys & Tutorials, 13(3), 311-336.
  • Abadi, D. J. (2009). Data management in the cloud: Limitations and opportunities. IEEE Data Eng. Bull., 32(1), 3-12.

Software[edit]

See also[edit]

Other Resources[edit]

General[edit]

Data Annotation & Labelling[edit]

EDA[edit]

Asynchronous Communication & Microservices[edit]

Distributed Systems[edit]

Deployment and Production[edit]