Difference between revisions of "Data Science"

From Ioannis Kourouklides
Jump to navigation Jump to search
Line 139: Line 139:
 
*[https://medium.com/@williamkoehrsen William Koehrsen] - blog
 
*[https://medium.com/@williamkoehrsen William Koehrsen] - blog
 
*[http://www.claoudml.co/ ClaoudML] - Free Data Science & Machine Learning Resources
 
*[http://www.claoudml.co/ ClaoudML] - Free Data Science & Machine Learning Resources
*[https://medium.com/@Petuum/intro-to-distributed-deep-learning-systems-a2e45c6b8e7 Intro to Distributed Deep Learning Systems] - blog post
 
*[https://www.systems.ethz.ch/sites/default/files/parallel-distributed-deep-learning.pdf Parallel and Distributed Deep Learning by Tal Ben-Nun]
 
*[https://sebastianraschka.com/Articles/2014_multiprocessing.html An introduction to parallel programming using Python's multiprocessing module] - blog post
 
*[https://www.kdnuggets.com/2015/12/spark-deep-learning-training-with-sparknet.html Spark + Deep Learning: Distributed Deep Neural Network Training with SparkNet] - blog post
 
 
*[https://www.datasciencecentral.com/profiles/blogs/data-science-in-python-pandas-cheat-sheet Data Science in Python: Pandas Cheat Sheet]
 
*[https://www.datasciencecentral.com/profiles/blogs/data-science-in-python-pandas-cheat-sheet Data Science in Python: Pandas Cheat Sheet]
 
*[https://www.elastic.co/webinars/time-series-anomaly-detection-optimizing-machine-learning-jobs-in-elasticsearch Time Series Anomaly Detection: Optimizing your Machine Learning Jobs in Elasticsearch] - webinar
 
*[https://www.elastic.co/webinars/time-series-anomaly-detection-optimizing-machine-learning-jobs-in-elasticsearch Time Series Anomaly Detection: Optimizing your Machine Learning Jobs in Elasticsearch] - webinar
 
*[https://ai.googleblog.com/2017/04/federated-learning-collaborative.html Federated Learning: Collaborative Machine Learning without Centralized Training Data] - blog post
 
*[https://ai.googleblog.com/2017/04/federated-learning-collaborative.html Federated Learning: Collaborative Machine Learning without Centralized Training Data] - blog post
 
*[https://github.com/vsmolyakov/pyspark pyspark (GitHub)] - collection of resources
 
*[https://github.com/vsmolyakov/pyspark pyspark (GitHub)] - collection of resources
* [https://github.com/tmulc18/Distributed-TensorFlow-Guide Distributed-TensorFlow-Guide (GitHub)] - Distributed TensorFlow basics and examples of training algorithms (with code)
 
 
*[https://github.com/kaiwaehner/kafka-streams-machine-learning-examples kafka-streams-machine-learning-examples (GitHub)] - Machine Learning + Kafka Streams Examples (with code)
 
*[https://github.com/kaiwaehner/kafka-streams-machine-learning-examples kafka-streams-machine-learning-examples (GitHub)] - Machine Learning + Kafka Streams Examples (with code)
 
*[https://aseigneurin.github.io/2018/09/05/realtime-machine-learning-predictions-wth-kafka-and-h2o.html Realtime Machine Learning predictions with Kafka and H2O.ai] - blog post
 
*[https://aseigneurin.github.io/2018/09/05/realtime-machine-learning-predictions-wth-kafka-and-h2o.html Realtime Machine Learning predictions with Kafka and H2O.ai] - blog post
Line 161: Line 156:
 
*[https://github.com/SurrealAI/caraml caraml (GitHub)] - code
 
*[https://github.com/SurrealAI/caraml caraml (GitHub)] - code
 
*[https://github.com/SurrealAI/symphony symphony (GitHub)] - code
 
*[https://github.com/SurrealAI/symphony symphony (GitHub)] - code
*[http://muratbuffalo.blogspot.com/2016/04/petuum-new-platform-for-distributed.html Paper Review. Petuum: A new platform for distributed machine learning on big data] - blog post
 
*[http://www.cheerml.com/comparison-distributed-ml-platform A comparison of distributed machine learning platform] - blog post
 
 
*[https://www.analyticsindiamag.com/tensorflow-vs-spark-differ-work-tandem TensorFlow Vs. Spark: How Do They Differ And Work In Tandem With Each Other] - blog post
 
*[https://www.analyticsindiamag.com/tensorflow-vs-spark-differ-work-tandem TensorFlow Vs. Spark: How Do They Differ And Work In Tandem With Each Other] - blog post
 
*[https://github.com/bulutyazilim/awesome-datascience awesome-datascience (GitHub)]
 
*[https://github.com/bulutyazilim/awesome-datascience awesome-datascience (GitHub)]
 
*[https://github.com/siboehm/awesome-learn-datascience awesome-learn-datascience (GitHub)]
 
*[https://github.com/siboehm/awesome-learn-datascience awesome-learn-datascience (GitHub)]
 
*[https://www.logicalclocks.com/deep-learners-use-a-cluster-manager-for-your-gpus/ When Deep Learning with GPUs, use a Cluster Manager] - blog post
 
*[https://www.logicalclocks.com/deep-learners-use-a-cluster-manager-for-your-gpus/ When Deep Learning with GPUs, use a Cluster Manager] - blog post
*[https://www.logicalclocks.com/why-you-need-a-distributed-filesystem-for-deep-learning/ Distributed Filesystems for Deep Learning] - blog post
 
*[https://www.zurich.ibm.com/snapml/ Snap ML] - IBM
 
 
*[https://otonomo.io/blog/luigi-airflow-pinball-and-chronos-comparing-workflow-management-systems/ Luigi, Airflow, Pinball, and Chronos: Comparing Workflow Management Systems]
 
*[https://otonomo.io/blog/luigi-airflow-pinball-and-chronos-comparing-workflow-management-systems/ Luigi, Airflow, Pinball, and Chronos: Comparing Workflow Management Systems]
 
*[https://otonomo.io/blog/redis-kafka-or-rabbitmq-which-microservices-message-broker-to-choose/ Redis, Kafka or RabbitMQ: Which MicroServices Message Broker To Choose?]
 
*[https://otonomo.io/blog/redis-kafka-or-rabbitmq-which-microservices-message-broker-to-choose/ Redis, Kafka or RabbitMQ: Which MicroServices Message Broker To Choose?]
Line 184: Line 175:
 
*[https://github.com/ajaymache/data-analysis-using-python Exploratory data analysis using Python for used car database taken from Kaggle] - Github
 
*[https://github.com/ajaymache/data-analysis-using-python Exploratory data analysis using Python for used car database taken from Kaggle] - Github
 
*[https://www.kaggle.com/ekami66/detailed-exploratory-data-analysis-with-python Detailed exploratory data analysis with Python] - Kaggle
 
*[https://www.kaggle.com/ekami66/detailed-exploratory-data-analysis-with-python Detailed exploratory data analysis with Python] - Kaggle
* [https://www.youtube.com/watch?v=W5WE9Db2RLU Exploratory data analysis in Python - PyCon 2017 (Youtube)]
+
*[https://www.youtube.com/watch?v=W5WE9Db2RLU Exploratory data analysis in Python - PyCon 2017 (Youtube)]
 
*[https://medium.com/open-machine-learning-course/open-machine-learning-course-topic-1-exploratory-data-analysis-with-pandas-de57880f1a68 Exploratory Data Analysis with Pandas] - blog post
 
*[https://medium.com/open-machine-learning-course/open-machine-learning-course-topic-1-exploratory-data-analysis-with-pandas-de57880f1a68 Exploratory Data Analysis with Pandas] - blog post
 
*[https://www.kaggle.com/randylaosat/simple-exploratory-data-analysis-passnyc Simple Exploratory Data Analysis - PASSNYC] - Kaggle
 
*[https://www.kaggle.com/randylaosat/simple-exploratory-data-analysis-passnyc Simple Exploratory Data Analysis - PASSNYC] - Kaggle
 
*[https://www.kaggle.com/moizzz/eda-and-clustering EDA and Clustering] - Kaggle
 
*[https://www.kaggle.com/moizzz/eda-and-clustering EDA and Clustering] - Kaggle
 
*[https://medium.com/python-pandemonium/introduction-to-exploratory-data-analysis-in-python-8b6bcb55c190 Introduction to Exploratory Data Analysis in Python] - blog post
 
*[https://medium.com/python-pandemonium/introduction-to-exploratory-data-analysis-in-python-8b6bcb55c190 Introduction to Exploratory Data Analysis in Python] - blog post
  +
  +
=== Distributed Systems===
  +
*[https://blog.docker.com/2016/10/docker-distributed-system-summit-videos-podcast-episodes/ Docker Distributed System Summit videos podcast episodes]
  +
*[https://www.voltdb.com/files/using-docker-simplify-distributed-systems-development/ Using Docker to Simplify Distributed Systems in Development] - video
  +
*[https://medium.com/@harinilabs/day-11-getting-started-with-docker-and-using-it-to-build-deploy-a-distributed-app-1929669064b8 Day 11: Using Docker to build and deploy a distributed app] - blog post with [https://github.com/harinij/100DaysOfCode/tree/master/Day%20011%20-%20Docker%20WebApp code]
  +
*[https://medium.com/@Petuum/intro-to-distributed-deep-learning-systems-a2e45c6b8e7 Intro to Distributed Deep Learning Systems] - blog post
  +
*[https://www.systems.ethz.ch/sites/default/files/parallel-distributed-deep-learning.pdf Parallel and Distributed Deep Learning by Tal Ben-Nun]
  +
*[https://sebastianraschka.com/Articles/2014_multiprocessing.html An introduction to parallel programming using Python's multiprocessing module] - blog post
  +
*[https://www.kdnuggets.com/2015/12/spark-deep-learning-training-with-sparknet.html Spark + Deep Learning: Distributed Deep Neural Network Training with SparkNet] - blog post
  +
*[http://muratbuffalo.blogspot.com/2016/04/petuum-new-platform-for-distributed.html Paper Review. Petuum: A new platform for distributed machine learning on big data] - blog post
  +
*[http://www.cheerml.com/comparison-distributed-ml-platform A comparison of distributed machine learning platform] - blog post
  +
*[https://www.logicalclocks.com/why-you-need-a-distributed-filesystem-for-deep-learning/ Distributed Filesystems for Deep Learning] - blog post
  +
*[https://www.zurich.ibm.com/snapml/ Snap ML] - IBM
  +
*[https://github.com/tmulc18/Distributed-TensorFlow-Guide Distributed-TensorFlow-Guide (GitHub)] - Distributed TensorFlow basics and examples of training algorithms (with code)
   
 
===Deployment and Production===
 
===Deployment and Production===
Line 226: Line 231:
 
*[https://blog.cambridgespark.com/putting-machine-learning-models-into-production-d768560907bd Putting Machine Learning Models into Production] - blog post
 
*[https://blog.cambridgespark.com/putting-machine-learning-models-into-production-d768560907bd Putting Machine Learning Models into Production] - blog post
 
*[https://github.com/practicalAI/productionML productionML (GitHub)] - code for creating Production level API services for Machine Learning
 
*[https://github.com/practicalAI/productionML productionML (GitHub)] - code for creating Production level API services for Machine Learning
*[https://blog.docker.com/2016/10/docker-distributed-system-summit-videos-podcast-episodes/ Docker Distributed System Summit videos podcast episodes]
 
*[https://www.voltdb.com/files/using-docker-simplify-distributed-systems-development/ Using Docker to Simplify Distributed Systems in Development] - video
 
*[https://medium.com/@harinilabs/day-11-getting-started-with-docker-and-using-it-to-build-deploy-a-distributed-app-1929669064b8 Day 11: Using Docker to build and deploy a distributed app] - blog post with [https://github.com/harinij/100DaysOfCode/tree/master/Day%20011%20-%20Docker%20WebApp code]
 
 
*[https://medium.com/kredaro-engineering/ai-tales-building-machine-learning-pipeline-using-kubeflow-and-minio-4b88da30437b AI Tales: Building Machine learning pipeline using Kubeflow and Minio] - blog post
 
*[https://medium.com/kredaro-engineering/ai-tales-building-machine-learning-pipeline-using-kubeflow-and-minio-4b88da30437b AI Tales: Building Machine learning pipeline using Kubeflow and Minio] - blog post
 
*[https://github.com/ahkarami/Deep-Learning-in-Production Deep-Learning-in-Production (GitHub)]
 
*[https://github.com/ahkarami/Deep-Learning-in-Production Deep-Learning-in-Production (GitHub)]

Revision as of 16:02, 24 November 2020

This page contains resources about Data Science, including Data Engineering and Data Management.

Subfields and Concepts

  • Agile Data Science
  • Machine Learning / Data Mining
  • Exploratory Data Analysis (EDA)
  • Data Preparation and Data Preprocessing
  • Data Fusion and Data Integration
  • Data Wrangling / Data Munging
  • Data Scraping
  • Data Sampling
  • Data Cleaning
  • High Performance/Parallel/Distributed/Cloud Computing for Machine Learning
  • Concurrent/Multi-threading Computing for Machine Learning
  • Synchronous Communication
    • REST protocol
  • Asynchronous Communication / Asynchronous Messaging (for microservices)
    • Message broker/Message bus/Event bus/Integration broker/Interface engine
    • Message queue
  • Data Engineering, Data Management and Databases
  • Data Visualization
  • Big Data
  • Explainable AI (XAI) / Interpretable AI

Online courses

Video Lectures

Lecture Notes

Books

  • Lanaro, G. (2017). Python High Performance. Packt Publishing Ltd.
  • Wickham, H., & Grolemund, G. (2017). R for Data Science. O'Reilly Media.
  • VanderPlas, J. (2016). Python Data Science Handbook: Essential Tools for Working with Data. O'Reilly Media.
  • Pierfederici, F. (2016). Distributed Computing with Python. Packt Publishing Ltd.
  • Nolan, D., & Lang, D. T. (2015). Data Science in R: A Case Studies Approach to Computational Reasoning and Problem Solving. CRC Press.
  • Elston, S. F. (2015). Data Science in the Cloud with Microsoft Azure Machine Learning and R. O'Reilly Media, Inc.
  • Grus, J. (2015). Data Science from Scratch: First Principles with Python. O'Reilly Media.
  • Madhavan, S. (2015). Mastering Python for Data Science. Packt Publishing Ltd.
  • Leskovec, J., Rajaraman, A., & Ullman, J. D. (2014). Mining of massive datasets. Cambridge University Press. (link)
  • Zumel, N., Mount, J., & Porzak, J. (2014). Practical data science with R. Manning.
  • Schutt, R., & O'Neil, C. (2013). Doing data science: Straight talk from the frontline. O'Reilly Media.
  • Tukey, J. W. (1977). Exploratory data analysis. Addison-Wesley.

Scholarly Articles

  • Verbraeken, J., Wolting, M., Katzy, J., Kloppenburg, J., Verbelen, T., & Rellermeyer, J. S. (2020). A Survey on Distributed Machine Learning. ACM Computing Surveys (CSUR), 53(2), 1-33.
  • Buchlovsky, P. ... (2018). TF-Replicator: Distributed Machine Learning for Researchers. arXiv preprint arXiv:1902.00465.
  • Kang, D., Emmons, J., Abuzaid, F., Bailis, P., & Zaharia, M. (2017). NoScope: optimizing neural network queries over video at scale. Proceedings of the VLDB Endowment, 10(11), 1586-1597.
  • Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). Why Should I Trust You? Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1135-1144).
  • Xing, E. P., Ho, Q., Xie, P., & Wei, D. (2016). Strategies and principles of distributed machine learning on big data. Engineering, 2(2), 179-195.
  • Salloum, S., Dautov, R., Chen, X., Peng, P. X., & Huang, J. Z. (2016). Big data analytics on Apache Spark. International Journal of Data Science and Analytics, 1(3-4), 145-164.
  • Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., ... & Dennison, D. (2015). Hidden technical debt in machine learning systems. In Advances in Neural Information Processing Systems (pp. 2503-2511).
  • Huang, Y., Zhu, F., Yuan, M., Deng, K., Li, Y., Ni, B., ... & Zeng, J. (2015). Telco Churn Prediction with Big Data. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (pp. 607-618).
  • Moritz, P., Nishihara, R., Stoica, I., & Jordan, M. I. (2015). SparkNet: Training Deep Networks in Spark. arXiv preprint arXiv:1511.06051.
  • Upadhyaya, S. R. (2013). Parallel approaches to machine learning—A comprehensive survey. Journal of Parallel and Distributed Computing, 73(3), 284-292.
  • Sakr, S., Liu, A., Batista, D. M., & Alomari, M. (2011). A survey of large scale data management approaches in cloud environments. IEEE Communications Surveys & Tutorials, 13(3), 311-336.
  • Abadi, D. J. (2009). Data management in the cloud: Limitations and opportunities. IEEE Data Eng. Bull., 32(1), 3-12.

Software

See also

Other Resources

General

Data Annotation & Labelling

EDA

Distributed Systems

Deployment and Production