This wiki has no edits or logs made within the last 45 days, therefore it is marked as inactive. If you would like to prevent this wiki from being closed, please start showing signs of activity here. If there are no signs of this wiki being used within the next 15 days, this wiki may be closed per the Dormancy Policy. This wiki will then be eligible for adoption by another user. If not adopted and still inactive 135 days from now, this wiki will become eligible for deletion. Please be sure to familiarize yourself with Miraheze's Dormancy Policy. If you are a bureaucrat, you can go to Special:ManageWiki and uncheck "inactive" yourself. If you have any other questions or concerns, please don't hesitate to ask at Stewards' noticeboard.

Difference between revisions of "Data Science"

From Ioannis Kourouklides
Jump to navigation Jump to search
 
(151 intermediate revisions by 3 users not shown)
Line 1: Line 1:
This page contains resources about [https://en.wikipedia.org/wiki/Data_science Data Science], including '''Data Engineering'''.
+
This page contains resources about [https://en.wikipedia.org/wiki/Data_science Data Science], '''Data Engineering''' and [https://en.wikipedia.org/wiki/Data_management Data Management].
   
 
== Subfields and Concepts ==
 
== Subfields and Concepts ==
  +
* Agile Data Science
 
* [[Machine Learning]] / Data Mining
 
* [[Machine Learning]] / Data Mining
* Exploratory Data Analysis
+
* Exploratory Data Analysis (EDA)
* Data Preparation and Preprocessing
+
* Data Preparation and Data Preprocessing
  +
* Data Fusion and Data Integration
* High Performance/Parallel/Distributed Computing for Machine Learning
 
  +
* Data Wrangling / Data Munging
* Concurrent/Multi-threading Computing for Machine Learning
 
* Data Engineering and Databases
+
* Data Scraping
  +
* Data Sampling
  +
* Data Cleaning
 
* Data Visualization
 
* Data Visualization
  +
* Explainable AI (XAI) / Interpretable AI
 
* Big Data
 
* Big Data
  +
* Data Engineering, Data Management and Databases
  +
* High Performance/Parallel/Distributed/Cloud Computing for Machine Learning
  +
* Concurrent/Multi-threading Computing for Machine Learning
  +
* Synchronous Communication (for Web Services)
  +
** Representational State Transfer (REST) Protocol
  +
** Remote Procedure Call (RPC)
  +
** Simple Object Access Protocol (SOAP)
  +
* Asynchronous Communication / Asynchronous Messaging (for Web Services)
  +
** Message broker/Message bus/Event bus/Integration broker/Interface engine
  +
** Message queue
  +
** Asynchronous protocols
  +
*** Advanced Message Queuing Protocol (AMQP)
  +
*** MQ Telemetry Transport (MQTT)
  +
* Messaging patterns
  +
** Fire-and-Forget / One-Way
  +
** Request-Response / Request-Reply
  +
** Publisher-Subscriber
  +
** Request-Callback
  +
* Software Architecture
  +
** Monolithic Architecture
  +
** Microservices Architecture
  +
** Service-Oriented Architecture (SOA)
  +
* Stream Processing
   
 
== Online courses ==
 
== Online courses ==
Line 15: Line 42:
 
=== Video Lectures ===
 
=== Video Lectures ===
 
* [https://www.coursera.org/learn/competitive-data-science How to Win a Data Science Competition: Learn from Top Kagglers] - Coursera
 
* [https://www.coursera.org/learn/competitive-data-science How to Win a Data Science Competition: Learn from Top Kagglers] - Coursera
* [https://www.youtube.com/watch?v=W5WE9Db2RLU Exploratory data analysis in Python by Chloe Mawer and Jonathan Whitmore] - PyCon 2017
 
   
 
=== Lecture Notes ===
 
=== Lecture Notes ===
* [https://www.slideshare.net/kourouklides/what-is-data-science-99294704/ What is Data Science by Ioannis Kourouklides]
+
* [https://goo.gl/VSTGUQ Data Science by Ioannis Kourouklides]
 
* [https://www.csie.ntu.edu.tw/~cjlin/talks/bigdata-bilbao.pdf When <nowiki> [to use] </nowiki> and When Not to Use Distributed Machine Learning by Chih-Jen Lin]
 
* [https://www.csie.ntu.edu.tw/~cjlin/talks/bigdata-bilbao.pdf When <nowiki> [to use] </nowiki> and When Not to Use Distributed Machine Learning by Chih-Jen Lin]
 
* [https://medium.com/open-machine-learning-course/open-machine-learning-course-topic-1-exploratory-data-analysis-with-pandas-de57880f1a68 Open Machine Learning Course] (Medium)
 
* [https://medium.com/open-machine-learning-course/open-machine-learning-course-topic-1-exploratory-data-analysis-with-pandas-de57880f1a68 Open Machine Learning Course] (Medium)
Line 26: Line 52:
   
 
==Books==
 
==Books==
  +
* Newman, S. (2021). ''Building Microservices: Designing Fine-Grained Systems''. 2nd Ed. O'Reilly Media.
* Tukey, J. W. (1977). ''Exploratory data analysis''. Addison-Wesley.
 
* Schutt, R., & O'Neil, C. (2013). ''Doing data science: Straight talk from the frontline''. O'Reilly Media.
+
* Bellemare, A. (2020). ''Building Event-Driven Microservices: Leveraging Organizational Data at Scale''. O'Reilly Media.
  +
* Richards, M. (2020). ''Fundamentals of Software Architecture''. O'Reilly Media.
* Leskovec, J., Rajaraman, A., & Ullman, J. D. (2014). Mining of massive datasets. Cambridge University Press. ([http://www.mmds.org/ link])
 
* Zumel, N., Mount, J., & Porzak, J. (2014). ''Practical data science with R''. Manning.
+
* Dean A., & Crettaz, V. (2019). ''Event Streams in Action''. Manning.
  +
* Richardson, C. (2018). ''Microservices Patterns''. Manning Publications.
  +
* Pacheco, V. F. (2018). ''Microservice Patterns and Best Practices''. Packt Publishing.
  +
* De la Torre C., Wagner, B., & Rousos, M. (2018). ''.NET Microservices: Architecture for Containerized .NET Applications''. Microsoft Corporation. ([https://github.com/dzfweb/microsoft-microservices-book link])
  +
* Lanaro, G. (2017). ''Python High Performance''. Packt Publishing.
  +
* Wickham, H., & Grolemund, G. (2017). ''R for Data Science''. O'Reilly Media.
  +
* Kleppmann, M. (2017). ''Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems''. O'Reilly Media.
  +
* VanderPlas, J. (2016). ''Python Data Science Handbook: Essential Tools for Working with Data''. O'Reilly Media.
  +
* Pierfederici, F. (2016). ''Distributed Computing with Python''. Packt Publishing.
  +
* Dunning, T., & Friedman, E. (2016). ''Streaming Architecture: New Designs Using Apache Kafka and MapR Streams.'' O'Reilly Media.
 
* Nolan, D., & Lang, D. T. (2015). ''Data Science in R: A Case Studies Approach to Computational Reasoning and Problem Solving''. CRC Press.
 
* Nolan, D., & Lang, D. T. (2015). ''Data Science in R: A Case Studies Approach to Computational Reasoning and Problem Solving''. CRC Press.
 
* Elston, S. F. (2015). ''Data Science in the Cloud with Microsoft Azure Machine Learning and R.'' O'Reilly Media, Inc.
 
* Elston, S. F. (2015). ''Data Science in the Cloud with Microsoft Azure Machine Learning and R.'' O'Reilly Media, Inc.
 
* Grus, J. (2015). ''Data Science from Scratch: First Principles with Python''. O'Reilly Media.
 
* Grus, J. (2015). ''Data Science from Scratch: First Principles with Python''. O'Reilly Media.
* Madhavan, S. (2015). ''Mastering Python for Data Science''. Packt Publishing Ltd.
+
* Madhavan, S. (2015). ''Mastering Python for Data Science''. Packt Publishing.
  +
* Kale, V. (2015). ''Guide to Cloud Computing for Business and Technology Managers: From Distributed Computing to Cloudware Applications''. CRC Press.
* Blum, A., Hopcroft, J., & Kannan, R. (2015). Foundations of Data Science.
 
* VanderPlas, J. (2016). ''Python Data Science Handbook: Essential Tools for Working with Data''. O'Reilly Media.
+
* Ejsmont, A. (2015). ''Web Scalability for Startup Engineers''. McGraw Hill.
  +
* Leskovec, J., Rajaraman, A., & Ullman, J. D. (2014). ''Mining of Massive Datasets''. Cambridge University Press. ([http://www.mmds.org/ link])
* Wickham, H., & Grolemund, G. (2017). ''R for Data Science''. O'Reilly Media.
 
  +
* Zumel, N., Mount, J., & Porzak, J. (2014). ''Practical Data Science with R''. Manning.
  +
* Schutt, R., & O'Neil, C. (2013). ''Doing Data Science: Straight Talk from the Frontline''. O'Reilly Media.
  +
* Videla, A., & J.W. Williams, J. (2012). ''RabbitMQ in Action''. Manning.
  +
* Tukey, J. W. (1977). ''Exploratory Data Analysis''. Addison-Wesley.
   
 
==Scholarly Articles==
 
==Scholarly Articles==
  +
* Verbraeken, J., Wolting, M., Katzy, J., Kloppenburg, J., Verbelen, T., & Rellermeyer, J. S. (2020). A Survey on Distributed Machine Learning. ''ACM Computing Surveys (CSUR), 53''(2), 1-33.
  +
* Buchlovsky, P. ... (2018). TF-Replicator: Distributed Machine Learning for Researchers. ''arXiv preprint arXiv:1902.00465.''
  +
* Kang, D., Emmons, J., Abuzaid, F., Bailis, P., & Zaharia, M. (2017). NoScope: optimizing neural network queries over video at scale. ''Proceedings of the VLDB Endowment, 10''(11), 1586-1597.
  +
* Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). Why Should I Trust You? Explaining the Predictions of Any Classifier. In ''Proceedings of the 22nd [https://en.wikipedia.org/wiki/SIGKDD ACM SIGKDD International Conference on Knowledge Discovery and Data Mining]'' (pp. 1135-1144).
 
* Xing, E. P., Ho, Q., Xie, P., & Wei, D. (2016). Strategies and principles of distributed machine learning on big data. ''Engineering, 2''(2), 179-195.
 
* Xing, E. P., Ho, Q., Xie, P., & Wei, D. (2016). Strategies and principles of distributed machine learning on big data. ''Engineering, 2''(2), 179-195.
 
* Salloum, S., Dautov, R., Chen, X., Peng, P. X., & Huang, J. Z. (2016). Big data analytics on Apache Spark. ''International Journal of Data Science and Analytics, 1''(3-4), 145-164.
 
* Salloum, S., Dautov, R., Chen, X., Peng, P. X., & Huang, J. Z. (2016). Big data analytics on Apache Spark. ''International Journal of Data Science and Analytics, 1''(3-4), 145-164.
* Huang, Y., Zhu, F., Yuan, M., Deng, K., Li, Y., Ni, B., ... & Zeng, J. (2015). Telco Churn Prediction with Big Data. In ''Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data'' (pp. 607-618). ACM.
+
* Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., ... & Dennison, D. (2015). Hidden technical debt in machine learning systems. In ''[https://en.wikipedia.org/wiki/Conference_on_Neural_Information_Processing_Systems Advances in Neural Information Processing Systems]'' (pp. 2503-2511).
  +
* Huang, Y., Zhu, F., Yuan, M., Deng, K., Li, Y., Ni, B., ... & Zeng, J. (2015). Telco Churn Prediction with Big Data. In ''Proceedings of the 2015 [https://en.wikipedia.org/wiki/SIGMOD ACM SIGMOD International Conference on Management of Data]'' (pp. 607-618).
 
* Moritz, P., Nishihara, R., Stoica, I., & Jordan, M. I. (2015). SparkNet: Training Deep Networks in Spark. ''arXiv preprint arXiv:1511.06051''.
 
* Moritz, P., Nishihara, R., Stoica, I., & Jordan, M. I. (2015). SparkNet: Training Deep Networks in Spark. ''arXiv preprint arXiv:1511.06051''.
  +
* Upadhyaya, S. R. (2013). Parallel approaches to machine learning—A comprehensive survey. ''Journal of Parallel and Distributed Computing, 73''(3), 284-292.
  +
* Sakr, S., Liu, A., Batista, D. M., & Alomari, M. (2011). A survey of large scale data management approaches in cloud environments. ''IEEE Communications Surveys & Tutorials, 13''(3), 311-336.
  +
* Abadi, D. J. (2009). Data management in the cloud: Limitations and opportunities. ''IEEE Data Eng. Bull., 32''(1), 3-12.
   
 
==Software==
 
==Software==
 
* [https://www.docker.com/ Docker] (Containers)
 
* [https://www.docker.com/ Docker] (Containers)
 
* [https://www.anaconda.com/ Anaconda Distribution] - Python
 
* [https://www.anaconda.com/ Anaconda Distribution] - Python
  +
* [https://cython.org/ Cython] - Python
 
* [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Beautiful Soup 4] - Python
 
* [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Beautiful Soup 4] - Python
  +
* [https://lxml.de/ lxml] - Python
  +
* [https://selenium-python.readthedocs.io/ Selenium] - Python
  +
* [https://doc.scrapy.org/en/latest/index.html Scrapy] - Python
 
* [https://github.com/ray-project/ray ray] - Python
 
* [https://github.com/ray-project/ray ray] - Python
 
* [https://docs.python.org/3.4/library/multiprocessing.html multiprocessing] - Python
 
* [https://docs.python.org/3.4/library/multiprocessing.html multiprocessing] - Python
 
* [https://docs.python.org/3.4/library/threading.html threading] - Python
 
* [https://docs.python.org/3.4/library/threading.html threading] - Python
  +
* [https://github.com/ClimbsRocks/auto_ml auto_ml] - Python
  +
* [https://docs.celeryproject.org/en/stable/getting-started/introduction.html Celery] - Python
 
* [https://www.elastic.co/products/elasticsearch Elasticsearch], [https://www.elastic.co/products/logstash Logstash], [https://www.elastic.co/products/kibana Kibana] (ELK)
 
* [https://www.elastic.co/products/elasticsearch Elasticsearch], [https://www.elastic.co/products/logstash Logstash], [https://www.elastic.co/products/kibana Kibana] (ELK)
 
* [https://www.mongodb.com/ MongoDB]
 
* [https://www.mongodb.com/ MongoDB]
Line 58: Line 111:
 
* [https://spark.apache.org/ Apache Spark]
 
* [https://spark.apache.org/ Apache Spark]
 
* [https://hive.apache.org/ Apache Hive]
 
* [https://hive.apache.org/ Apache Hive]
* [http://kafka.apache.org/ Apache Kafka], which includes [https://www.confluent.io/product/connectors/ Kafka Connect]
 
 
* [http://cassandra.apache.org/ Apache Cassandra]
 
* [http://cassandra.apache.org/ Apache Cassandra]
 
* [https://zookeeper.apache.org/ Apache ZooKeeper]
 
* [https://zookeeper.apache.org/ Apache ZooKeeper]
Line 65: Line 117:
 
* [http://couchdb.apache.org/ Apache CouchDB]
 
* [http://couchdb.apache.org/ Apache CouchDB]
 
* [http://activemq.apache.org/ Apache ActiveMQ]
 
* [http://activemq.apache.org/ Apache ActiveMQ]
* [https://www.rabbitmq.com/ RabbitMQ]
+
* [http://samza.apache.org/ Apache Samza]
  +
* [https://flink.apache.org/ Apache Flink]
  +
* [http://kafka.apache.org/ Apache Kafka] (which includes [https://www.confluent.io/product/connectors/ Kafka Connect]) - A message broker
  +
* [https://www.rabbitmq.com/ RabbitMQ] - A message broker
  +
* [https://redis.io/ Redis] - A message broker
 
* [https://spark.apache.org/docs/latest/api/python/index.html pyspark] - Spark Python API
 
* [https://spark.apache.org/docs/latest/api/python/index.html pyspark] - Spark Python API
 
* [http://platanios.org/tensorflow_scala/ tensorflow_scala] - Scala API for TensorFlow
 
* [http://platanios.org/tensorflow_scala/ tensorflow_scala] - Scala API for TensorFlow
Line 71: Line 127:
 
* [https://github.com/yahoo/TensorFlowOnSpark TensorFlowOnSpark] - It brings TensorFlow programs onto Apache Spark clusters
 
* [https://github.com/yahoo/TensorFlowOnSpark TensorFlowOnSpark] - It brings TensorFlow programs onto Apache Spark clusters
 
* [https://numba.pydata.org/ Numba] - Python
 
* [https://numba.pydata.org/ Numba] - Python
  +
* [https://graphql.org/ GraphQL]
  +
* [https://www.nginx.com/ nginx]
  +
* [https://dvc.org/ DVC] - Data Version Control
  +
* [https://www.kubeflow.org/ kubeflow]
  +
* [https://akka.io/ Akka]
  +
* [https://www.pykka.org/ Pykka]
  +
* [https://apache.github.io/incubator-heron/ Heron]
  +
* [https://airflow.apache.org/ Apache Airflow] - Workflow Management System
  +
* [http://druid.io/ Druid]
  +
* [https://superset.incubator.apache.org/druid.html Apache Superset]
  +
* [https://github.com/horovod/horovod Horovod] - TensorFlow, Keras, PyTorch, and MXNet
  +
* [https://www.acumos.org/ Acumos AI]
  +
* [https://hopsworks.readthedocs.io/en/0.9/hopsml/hopsML.html HopsML]
  +
* [https://arrow.apache.org/ Apache Arrow]
   
 
==See also==
 
==See also==
Line 76: Line 146:
   
 
==Other Resources==
 
==Other Resources==
  +
===General===
  +
*[https://www.slideshare.net/kourouklides/what-is-data-science-99294704/ What is Data Science by Ioannis Kourouklides] - slides
 
*[https://datascienceguide.github.io/ Data Science Guide]
 
*[https://datascienceguide.github.io/ Data Science Guide]
 
*[http://jadianes.me/data-science-your-way/ Data Science Engineering, your way]
 
*[http://jadianes.me/data-science-your-way/ Data Science Engineering, your way]
  +
*[https://www.oreilly.com/ideas/a-manifesto-for-agile-data-science A manifesto for Agile data science] - blog post
  +
*[https://towardsdatascience.com/data-science-project-flow-for-startups-282a93d4508d Data Science Project Flow for Startups] - blog post
 
*[http://www.cse.ust.hk/~kxmo/LargeML.html Large Scale Machine Learning] - libraries and papers
 
*[http://www.cse.ust.hk/~kxmo/LargeML.html Large Scale Machine Learning] - libraries and papers
 
*[https://www.quora.com/What-are-some-courses-on-large-scale-learning What are some courses on large scale learning?] - Quora
 
*[https://www.quora.com/What-are-some-courses-on-large-scale-learning What are some courses on large scale learning?] - Quora
 
*[https://www.kdnuggets.com/2017/06/7-steps-mastering-data-preparation-python.html 7 Steps to Mastering Data Preparation with Python] - blog post
 
*[https://www.kdnuggets.com/2017/06/7-steps-mastering-data-preparation-python.html 7 Steps to Mastering Data Preparation with Python] - blog post
 
*[https://www.kdnuggets.com/2017/12/baesens-web-scraping-data-science-python.html Web Scraping for Data Science with Python] - blog post
 
*[https://www.kdnuggets.com/2017/12/baesens-web-scraping-data-science-python.html Web Scraping for Data Science with Python] - blog post
*[https://medium.com/@Petuum/intro-to-distributed-deep-learning-systems-a2e45c6b8e7 Intro to Distributed Deep Learning Systems] - blog post
 
 
*[http://vlad17.github.io/COS513-Blog/ Princeton Commodities Modeling Blog]
 
*[http://vlad17.github.io/COS513-Blog/ Princeton Commodities Modeling Blog]
*[https://github.com/ajaymache/data-analysis-using-python Exploratory data analysis using Python for used car database taken from Kaggle] - Github
 
*[https://www.kaggle.com/ekami66/detailed-exploratory-data-analysis-with-python Detailed exploratory data analysis with Python] - Kaggle
 
 
*[https://github.com/upalr/Python-camp Python-camp] - Github
 
*[https://github.com/upalr/Python-camp Python-camp] - Github
 
*[http://mtitek.com/big-data.php Big Data: Spark, Hadoop, Hive, ZooKeeper, Solr, Kafka, Nutch, MongoDB, ...] - installation instructions
 
*[http://mtitek.com/big-data.php Big Data: Spark, Hadoop, Hive, ZooKeeper, Solr, Kafka, Nutch, MongoDB, ...] - installation instructions
 
*[https://databricks.com/blog/2016/01/25/deep-learning-with-apache-spark-and-tensorflow.html Deep Learning with Apache Spark and TensorFlow] - blog post
 
*[https://databricks.com/blog/2016/01/25/deep-learning-with-apache-spark-and-tensorflow.html Deep Learning with Apache Spark and TensorFlow] - blog post
 
*[https://khartig.wordpress.com/2017/12/30/build-a-simple-chatbot-with-tensorflow-python-and-mongodb/ Build a Simple Chatbot with Tensorflow, Python and MongoDB] - blog post
 
*[https://khartig.wordpress.com/2017/12/30/build-a-simple-chatbot-with-tensorflow-python-and-mongodb/ Build a Simple Chatbot with Tensorflow, Python and MongoDB] - blog post
*[https://medium.com/open-machine-learning-course/open-machine-learning-course-topic-2-visual-data-analysis-in-python-846b989675cd Visual Data Analysis with Python] - blog post
 
*[https://medium.com/open-machine-learning-course/open-machine-learning-course-topic-1-exploratory-data-analysis-with-pandas-de57880f1a68 Exploratory Data Analysis with Pandas] - blog post
 
 
*[https://plot.ly/python/maps/ Plotly Python Library Maps]
 
*[https://plot.ly/python/maps/ Plotly Python Library Maps]
 
*[https://towardsdatascience.com/5-quick-and-easy-data-visualizations-in-python-with-code-a2284bae952f 5 Quick and Easy Data Visualizations in Python with Code] - blog post
 
*[https://towardsdatascience.com/5-quick-and-easy-data-visualizations-in-python-with-code-a2284bae952f 5 Quick and Easy Data Visualizations in Python with Code] - blog post
 
*[https://medium.com/@williamkoehrsen William Koehrsen] - blog
 
*[https://medium.com/@williamkoehrsen William Koehrsen] - blog
 
*[http://www.claoudml.co/ ClaoudML] - Free Data Science & Machine Learning Resources
 
*[http://www.claoudml.co/ ClaoudML] - Free Data Science & Machine Learning Resources
*[https://www.systems.ethz.ch/sites/default/files/parallel-distributed-deep-learning.pdf Parallel and Distributed Deep Learning by Tal Ben-Nun]
 
*[https://sebastianraschka.com/Articles/2014_multiprocessing.html An introduction to parallel programming using Python's multiprocessing module] - blog post
 
*[https://blog.cambridgespark.com/putting-machine-learning-models-into-production-d768560907bd Putting Machine Learning Models into Production] - blog post
 
*[https://www.kdnuggets.com/2015/12/spark-deep-learning-training-with-sparknet.html Spark + Deep Learning: Distributed Deep Neural Network Training with SparkNet] - blog post
 
 
*[https://www.datasciencecentral.com/profiles/blogs/data-science-in-python-pandas-cheat-sheet Data Science in Python: Pandas Cheat Sheet]
 
*[https://www.datasciencecentral.com/profiles/blogs/data-science-in-python-pandas-cheat-sheet Data Science in Python: Pandas Cheat Sheet]
*[https://www.kaggle.com/randylaosat/simple-exploratory-data-analysis-passnyc Simple Exploratory Data Analysis - PASSNYC] - Kaggle
 
*[https://www.kaggle.com/moizzz/eda-and-clustering EDA and Clustering] - Kaggle
 
 
*[https://www.elastic.co/webinars/time-series-anomaly-detection-optimizing-machine-learning-jobs-in-elasticsearch Time Series Anomaly Detection: Optimizing your Machine Learning Jobs in Elasticsearch] - webinar
 
*[https://www.elastic.co/webinars/time-series-anomaly-detection-optimizing-machine-learning-jobs-in-elasticsearch Time Series Anomaly Detection: Optimizing your Machine Learning Jobs in Elasticsearch] - webinar
*[https://www.elastic.co/webinars/event-logs-in-elasticsearch-and-machine-learning Web Access Logs in Elasticsearch and Machine Learning] - webinar
 
*[https://www.youtube.com/watch?v=f3I0izerPvc Deploying Python models to production] - video
 
*[https://www.youtube.com/watch?v=knAFR4u73Es Deploying Machine Learning apps with Docker containers - MUPy 2017] - video
 
*[https://www.youtube.com/watch?v=-UYyyeYJAoQ How to deploy machine learning models into production] - video
 
 
*[https://ai.googleblog.com/2017/04/federated-learning-collaborative.html Federated Learning: Collaborative Machine Learning without Centralized Training Data] - blog post
 
*[https://ai.googleblog.com/2017/04/federated-learning-collaborative.html Federated Learning: Collaborative Machine Learning without Centralized Training Data] - blog post
  +
*[https://www.zurich.ibm.com/snapml/ Snap ML] - IBM
  +
*[https://github.com/vsmolyakov/pyspark pyspark (GitHub)] - collection of resources
  +
*[https://becominghuman.ai/cheat-sheets-for-ai-neural-networks-machine-learning-deep-learning-big-data-678c51b4b463 Cheat Sheets for AI, Neural Networks, Machine Learning, Deep Learning & Big Data] - blog post
  +
*[https://eng.uber.com/peloton/ Peloton: Uber’s Unified Resource Scheduler for Diverse Cluster Workloads] - blog post
  +
*[http://martin.zinkevich.org/rules_of_ml/rules_of_ml.pdf Rules of Machine Learning: Best Practices for ML Engineering] - blog post
  +
*[https://blog.kovalevskyi.com/google-compute-engine-now-has-images-with-pytorch-1-0-0-and-fastai-1-0-2-57c49efd74bb Google Compute Engine Now Has Images With PyTorch 1.0.0 and FastAi 1.0.2] - blog post
  +
*[https://eng.uber.com/michelangelo-pyml/ Michelangelo PyML: Introducing Uber’s Platform for Rapid Python ML Model Development]
  +
*[https://towardsdatascience.com/manage-your-data-science-project-structure-in-early-stage-95f91d4d0600 Manage your Data Science project structure in early stage] - blog post
  +
*[https://medium.com/@rrfd/cookiecutter-data-science-organize-your-projects-atom-and-jupyter-2be7862f487e Cookiecutter Data Science — Organize your Projects — Atom and Jupyter] - blog post
  +
*[https://github.com/SurrealAI/surreal surreal (GitHub)] - code
  +
*[https://github.com/SurrealAI/cloudwise cloudwise (GitHub)] - code
  +
*[https://github.com/SurrealAI/caraml caraml (GitHub)] - code
  +
*[https://github.com/SurrealAI/symphony symphony (GitHub)] - code
  +
*[https://www.analyticsindiamag.com/tensorflow-vs-spark-differ-work-tandem TensorFlow Vs. Spark: How Do They Differ And Work In Tandem With Each Other] - blog post
  +
*[https://github.com/bulutyazilim/awesome-datascience awesome-datascience (GitHub)]
  +
*[https://github.com/siboehm/awesome-learn-datascience awesome-learn-datascience (GitHub)]
  +
*[https://www.logicalclocks.com/blog/when-deep-learning-with-gpus-use-a-cluster-manager When Deep Learning with GPUs, use a Cluster Manager] - blog post
  +
  +
===Data Annotation & Labelling===
  +
*[https://appen.com/blog/data-annotation/ What is Data Annotation?]
  +
*[https://www.mturk.com Amazon Mechanical Turk]
  +
*[https://www.cloudfactory.com/ CloudFactory]
  +
*[https://appen.com/ Appen]
  +
*[https://www.alegion.com/ Alegion]
  +
*[https://imerit.net/ iMerit]
  +
*[https://playment.io/ Playment]
  +
*[https://www.rev.com/ Rev] - Transcription from video and audio
  +
*[https://labelbox.com/ Labelbox]
  +
*[https://github.com/diffgram/diffgram diffgram]
  +
*[https://dl.acm.org/citation.cfm?id=1866696 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk]
  +
*[https://www.cloudfactory.com/data-annotation-tool-guide Data Annotation Tools for Machine Learning (Evolving Guide)]
  +
*[https://github.com/taivop/awesome-data-annotation awesome-data-annotation (GitHub)]
  +
  +
===EDA===
  +
*[https://github.com/ajaymache/data-analysis-using-python Exploratory data analysis using Python for used car database taken from Kaggle] - Github
  +
*[https://www.kaggle.com/ekami66/detailed-exploratory-data-analysis-with-python Detailed exploratory data analysis with Python] - Kaggle
  +
*[https://www.youtube.com/watch?v=W5WE9Db2RLU Exploratory data analysis in Python - PyCon 2017 (Youtube)]
  +
*[https://medium.com/open-machine-learning-course/open-machine-learning-course-topic-1-exploratory-data-analysis-with-pandas-de57880f1a68 Exploratory Data Analysis with Pandas] - blog post
  +
*[https://www.kaggle.com/randylaosat/simple-exploratory-data-analysis-passnyc Simple Exploratory Data Analysis - PASSNYC] - Kaggle
  +
*[https://www.kaggle.com/moizzz/eda-and-clustering EDA and Clustering] - Kaggle
  +
*[https://medium.com/python-pandemonium/introduction-to-exploratory-data-analysis-in-python-8b6bcb55c190 Introduction to Exploratory Data Analysis in Python] - blog post
  +
*[https://medium.com/open-machine-learning-course/open-machine-learning-course-topic-2-visual-data-analysis-in-python-846b989675cd Visual Data Analysis with Python] - blog post
  +
  +
===Asynchronous Communication & Microservices===
  +
*[https://microservices.io/patterns/microservices.html Pattern: Microservice Architecture]
  +
*[https://www.dineshonjava.com/software-architecture-patterns-and-designs/ Software Architecture Patterns and Designs]
  +
*[https://codeblog.dotsandbrackets.com/asynchronous-communication-with-message-queue/ Asynchronous communication with message queue]
  +
*[https://garba.org/article/general/soa/mep.html Message Exchange Patterns (MEPs)]
  +
*[https://flylib.com/books/en/2.365.1/message_exchange_patterns.html Message exchange patterns]
  +
*[https://docs.microsoft.com/en-us/azure/architecture/patterns/category/messaging Messaging patterns]
  +
*[https://medium.com/@mmz.zaeimi/synchronous-vs-asynchronous-communication-in-microservices-integration-f4dd36478fd2 Synchronous vs Asynchronous communication in microservices integration]
  +
*[https://otonomo.io/blog/redis-kafka-or-rabbitmq-which-microservices-message-broker-to-choose/ Redis, Kafka or RabbitMQ: Which MicroServices Message Broker To Choose?]
  +
*[https://dzone.com/articles/akka-streams-and-kafka-streams-where-microservices Akka Streams and Kafka Streams: Where Microservices Meet Fast Data]
  +
*[https://dzone.com/articles/akka-spark-or-kafka-selecting-the-right-streaming Akka, Spark, or Kafka? Selecting the Right Streaming Engine]
  +
*[https://otonomo.io/blog/luigi-airflow-pinball-and-chronos-comparing-workflow-management-systems/ Luigi, Airflow, Pinball, and Chronos: Comparing Workflow Management Systems]
  +
*[https://github.com/kaiwaehner/kafka-streams-machine-learning-examples kafka-streams-machine-learning-examples (GitHub)] - Machine Learning + Kafka Streams Examples (with code)
  +
*[https://aseigneurin.github.io/2018/09/05/realtime-machine-learning-predictions-wth-kafka-and-h2o.html Realtime Machine Learning predictions with Kafka and H2O.ai] - blog post
  +
*[https://tanzu.vmware.com/content/blog/understanding-when-to-use-rabbitmq-or-apache-kafka Understanding When to use RabbitMQ or Apache Kafka]
  +
*[https://www.ververica.com/what-is-stream-processing What is Stream Processing?]
  +
*[https://medium.com/stream-processing/what-is-stream-processing-1eadfca11b97 A Gentle Introduction to Stream Processing]
  +
  +
=== Distributed Systems===
  +
*[https://blog.docker.com/2016/10/docker-distributed-system-summit-videos-podcast-episodes/ Docker Distributed System Summit videos podcast episodes]
  +
*[https://www.voltdb.com/files/using-docker-simplify-distributed-systems-development/ Using Docker to Simplify Distributed Systems in Development] - video
  +
*[https://medium.com/@harinilabs/day-11-getting-started-with-docker-and-using-it-to-build-deploy-a-distributed-app-1929669064b8 Day 11: Using Docker to build and deploy a distributed app] - blog post with [https://github.com/harinij/100DaysOfCode/tree/master/Day%20011%20-%20Docker%20WebApp code]
  +
*[https://medium.com/@Petuum/intro-to-distributed-deep-learning-systems-a2e45c6b8e7 Intro to Distributed Deep Learning Systems] - blog post
  +
*[https://www.systems.ethz.ch/sites/default/files/parallel-distributed-deep-learning.pdf Parallel and Distributed Deep Learning by Tal Ben-Nun]
  +
*[https://sebastianraschka.com/Articles/2014_multiprocessing.html An introduction to parallel programming using Python's multiprocessing module] - blog post
  +
*[https://www.kdnuggets.com/2015/12/spark-deep-learning-training-with-sparknet.html Spark + Deep Learning: Distributed Deep Neural Network Training with SparkNet] - blog post
  +
*[http://muratbuffalo.blogspot.com/2016/04/petuum-new-platform-for-distributed.html Paper Review. Petuum: A new platform for distributed machine learning on big data] - blog post
  +
*[http://www.cheerml.com/comparison-distributed-ml-platform A comparison of distributed machine learning platform] - blog post
  +
*[https://www.logicalclocks.com/why-you-need-a-distributed-filesystem-for-deep-learning/ Distributed Filesystems for Deep Learning] - blog post
  +
*[https://github.com/tmulc18/Distributed-TensorFlow-Guide Distributed-TensorFlow-Guide (GitHub)] - Distributed TensorFlow basics and examples of training algorithms (with code)
  +
  +
===Deployment and Production===
  +
*[https://towardsdatascience.com/how-docker-can-help-you-become-a-more-effective-data-scientist-7fc048ef91d5 How Docker Can Help You Become A More Effective Data Scientist] - blog post
  +
*[https://www.confluent.io/blog/build-deploy-scalable-machine-learning-production-apache-kafka/ How to Build and Deploy Scalable Machine Learning in Production with Apache Kafka] - blog post
  +
*[https://towardsdatascience.com/deploying-deep-learning-models-part-1-an-overview-77b4d01dd6f7 Deploying deep learning models: Part 1 an overview] - blog post
  +
*[https://medium.com/@maheshkkumar/a-guide-to-deploying-machine-deep-learning-model-s-in-production-e497fd4b734a A guide to deploying Machine/Deep Learning model(s) in Production] - blog post
  +
*[https://medium.com/redbus-in/how-to-deploy-scikit-learn-ml-models-d390b4b8ce7a How redBus uses Scikit-Learn ML models to classify customer complaints?] - blog post
  +
*[https://willk.online/deploying-a-keras-deep-learning-model-as-a-web-application-in-p/ Deploying a Keras Deep Learning Model as a Web Application in Python] - blog post
  +
*[https://awesome-docker.netlify.com/ Awesome-docker] - A curated list of Docker resources and projects
  +
*[https://ramitsurana.github.io/awesome-kubernetes/ Awesome-Kubernetes] - A curated list for awesome kubernetes sources
  +
*[https://www.youtube.com/watch?v=zxcvyrhmjbc Michael Herman - Going Serverless with OpenFaaS, Kubernetes, and Python - PyCon 2018 (Youtube)]
  +
*[https://www.youtube.com/watch?v=jbb1dbFaovg Aly Sivji, Joe Jasinski, tathagata dasgupta (t) - Docker for Data Science - PyCon 2018 (Youtube)]
  +
*[https://www.youtube.com/watch?v=kx-048qE-TI Ruben Orduz, Nolan Brubaker - A Python-flavored Introduction to Containers And Kubernetes (Youtube)] - PyCon 2018
  +
*[https://www.youtube.com/watch?v=nrzLdMWTRMM Miguel Grinberg - Microservices with Python and Flask - PyCon 2017 (Youtube)]
  +
*[https://www.youtube.com/watch?v=EuzoEaE6Cqs Deploy and scale containers with Docker native, open source orchestration PyCon 2017 (Youtube)]
  +
*[https://www.youtube.com/watch?v=tdIIJuPh3SI Miguel Grinberg - Flask at Scale - PyCon 2016 (Youtube)]
  +
*[https://www.youtube.com/watch?v=GpHMTR7P2Ms Deploying and scaling applications with Docker, Swarm, and a tiny bit of Python magic - PyCon 2016 (Youtube)]
  +
*[https://www.youtube.com/watch?v=ZVaRK10HBjo Jérôme Petazzoni - Introduction to Docker and containers - PyCon 2016 (Youtube)]
  +
*[https://www.youtube.com/watch?v=DIcpEg77gdE Miguel Grinberg - Flask Workshop - PyCon 2015 (Youtube)]
  +
*[https://www.youtube.com/watch?v=YiZkHUbE6N0 Andrew T. Baker - Docker 101: Introduction to Docker - PyCon 2015 (Youtube)]
  +
*[https://www.youtube.com/watch?v=FGrIyBDQLPg Miguel Grinberg: Flask by Example - PyCon 2014 (Youtube)]
 
*[https://towardsdatascience.com/learn-to-build-machine-learning-services-prototype-real-applications-and-deploy-your-work-to-aa97b2b09e0c Learn to Build Machine Learning Services, Prototype Real Applications, and Deploy your Work to Users] - blog post
 
*[https://towardsdatascience.com/learn-to-build-machine-learning-services-prototype-real-applications-and-deploy-your-work-to-aa97b2b09e0c Learn to Build Machine Learning Services, Prototype Real Applications, and Deploy your Work to Users] - blog post
 
*[https://towardsdatascience.com/deploying-keras-deep-learning-models-with-flask-5da4181436a2 Deploying Keras Deep Learning Models with Flask] - blog post
 
*[https://towardsdatascience.com/deploying-keras-deep-learning-models-with-flask-5da4181436a2 Deploying Keras Deep Learning Models with Flask] - blog post
  +
*[https://www.twilio.com/engineering/2012/10/18/open-sourcing-flask-restful Introducing Flask-RESTful] - blog post
  +
*[https://towardsdatascience.com/develop-a-nlp-model-in-python-deploy-it-with-flask-step-by-step-744f3bdd7776 Develop a NLP Model in Python & Deploy It with Flask, Step by Step] - blog post
  +
*[https://www.youtube.com/watch?v=knAFR4u73Es Deploying Machine Learning apps with Docker containers - MUPy 2017] - video
 
*[https://medium.com/@patrickmichelberger/getting-started-with-anaconda-docker-b50a2c482139 Getting started with Anaconda & Docker] - blog post
 
*[https://medium.com/@patrickmichelberger/getting-started-with-anaconda-docker-b50a2c482139 Getting started with Anaconda & Docker] - blog post
 
*[https://towardsdatascience.com/docker-for-data-science-9c0ce73e8263 Docker for Data Science] - blog post
 
*[https://towardsdatascience.com/docker-for-data-science-9c0ce73e8263 Docker for Data Science] - blog post
 
*[https://towardsdatascience.com/how-docker-can-help-you-become-a-more-effective-data-scientist-7fc048ef91d5 How Docker Can Help You Become A More Effective Data Scientist] - blog post
 
*[https://towardsdatascience.com/how-docker-can-help-you-become-a-more-effective-data-scientist-7fc048ef91d5 How Docker Can Help You Become A More Effective Data Scientist] - blog post
  +
*[https://becominghuman.ai/docker-for-data-science-part-1-dd41e5ef1d80 Simplified Docker-ing for Data Science — Part 1] - blog post
 
*[https://www.born2data.com/2017/deeplearning_install-part4.html Deep Learning Installation Tutorial - Part 4: How to install Docker for Deep Learning ] - blog post
 
*[https://www.born2data.com/2017/deeplearning_install-part4.html Deep Learning Installation Tutorial - Part 4: How to install Docker for Deep Learning ] - blog post
  +
*[https://towardsdatascience.com/how-to-write-a-production-level-code-in-data-science-5d87bd75ced How to write a production-level code in Data Science?] - blog post
*[https://github.com/vsmolyakov/pyspark pyspark (GitHub)] - collection of resources
 
  +
*[https://www.elastic.co/webinars/event-logs-in-elasticsearch-and-machine-learning Web Access Logs in Elasticsearch and Machine Learning] - webinar
  +
*[https://www.youtube.com/watch?v=f3I0izerPvc Deploying Python models to production] - video
  +
*[https://www.youtube.com/watch?v=-UYyyeYJAoQ How to deploy machine learning models into production] - video
  +
*[https://blog.cambridgespark.com/putting-machine-learning-models-into-production-d768560907bd Putting Machine Learning Models into Production] - blog post
  +
*[https://github.com/practicalAI/productionML productionML (GitHub)] - code for creating Production level API services for Machine Learning
  +
*[https://medium.com/kredaro-engineering/ai-tales-building-machine-learning-pipeline-using-kubeflow-and-minio-4b88da30437b AI Tales: Building Machine learning pipeline using Kubeflow and Minio] - blog post
  +
*[https://github.com/ahkarami/Deep-Learning-in-Production Deep-Learning-in-Production (GitHub)]
  +
*[https://medium.com/dataswati-garage/create-a-robust-ai-rest-api-71a8050ce314 Deploy your AI model the hard (and robust) way] - blog post

Latest revision as of 21:35, 24 November 2020

This page contains resources about Data Science, Data Engineering and Data Management.

Subfields and Concepts[edit]

  • Agile Data Science
  • Machine Learning / Data Mining
  • Exploratory Data Analysis (EDA)
  • Data Preparation and Data Preprocessing
  • Data Fusion and Data Integration
  • Data Wrangling / Data Munging
  • Data Scraping
  • Data Sampling
  • Data Cleaning
  • Data Visualization
  • Explainable AI (XAI) / Interpretable AI
  • Big Data
  • Data Engineering, Data Management and Databases
  • High Performance/Parallel/Distributed/Cloud Computing for Machine Learning
  • Concurrent/Multi-threading Computing for Machine Learning
  • Synchronous Communication (for Web Services)
    • Representational State Transfer (REST) Protocol
    • Remote Procedure Call (RPC)
    • Simple Object Access Protocol (SOAP)
  • Asynchronous Communication / Asynchronous Messaging (for Web Services)
    • Message broker/Message bus/Event bus/Integration broker/Interface engine
    • Message queue
    • Asynchronous protocols
      • Advanced Message Queuing Protocol (AMQP)
      • MQ Telemetry Transport (MQTT)
  • Messaging patterns
    • Fire-and-Forget / One-Way
    • Request-Response / Request-Reply
    • Publisher-Subscriber
    • Request-Callback
  • Software Architecture
    • Monolithic Architecture
    • Microservices Architecture
    • Service-Oriented Architecture (SOA)
  • Stream Processing

Online courses[edit]

Video Lectures[edit]

Lecture Notes[edit]

Books[edit]

  • Newman, S. (2021). Building Microservices: Designing Fine-Grained Systems. 2nd Ed. O'Reilly Media.
  • Bellemare, A. (2020). Building Event-Driven Microservices: Leveraging Organizational Data at Scale. O'Reilly Media.
  • Richards, M. (2020). Fundamentals of Software Architecture. O'Reilly Media.
  • Dean A., & Crettaz, V. (2019). Event Streams in Action. Manning.
  • Richardson, C. (2018). Microservices Patterns. Manning Publications.
  • Pacheco, V. F. (2018). Microservice Patterns and Best Practices. Packt Publishing.
  • De la Torre C., Wagner, B., & Rousos, M. (2018). .NET Microservices: Architecture for Containerized .NET Applications. Microsoft Corporation. (link)
  • Lanaro, G. (2017). Python High Performance. Packt Publishing.
  • Wickham, H., & Grolemund, G. (2017). R for Data Science. O'Reilly Media.
  • Kleppmann, M. (2017). Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. O'Reilly Media.
  • VanderPlas, J. (2016). Python Data Science Handbook: Essential Tools for Working with Data. O'Reilly Media.
  • Pierfederici, F. (2016). Distributed Computing with Python. Packt Publishing.
  • Dunning, T., & Friedman, E. (2016). Streaming Architecture: New Designs Using Apache Kafka and MapR Streams. O'Reilly Media.
  • Nolan, D., & Lang, D. T. (2015). Data Science in R: A Case Studies Approach to Computational Reasoning and Problem Solving. CRC Press.
  • Elston, S. F. (2015). Data Science in the Cloud with Microsoft Azure Machine Learning and R. O'Reilly Media, Inc.
  • Grus, J. (2015). Data Science from Scratch: First Principles with Python. O'Reilly Media.
  • Madhavan, S. (2015). Mastering Python for Data Science. Packt Publishing.
  • Kale, V. (2015). Guide to Cloud Computing for Business and Technology Managers: From Distributed Computing to Cloudware Applications. CRC Press.
  • Ejsmont, A. (2015). Web Scalability for Startup Engineers. McGraw Hill.
  • Leskovec, J., Rajaraman, A., & Ullman, J. D. (2014). Mining of Massive Datasets. Cambridge University Press. (link)
  • Zumel, N., Mount, J., & Porzak, J. (2014). Practical Data Science with R. Manning.
  • Schutt, R., & O'Neil, C. (2013). Doing Data Science: Straight Talk from the Frontline. O'Reilly Media.
  • Videla, A., & J.W. Williams, J. (2012). RabbitMQ in Action. Manning.
  • Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley.

Scholarly Articles[edit]

  • Verbraeken, J., Wolting, M., Katzy, J., Kloppenburg, J., Verbelen, T., & Rellermeyer, J. S. (2020). A Survey on Distributed Machine Learning. ACM Computing Surveys (CSUR), 53(2), 1-33.
  • Buchlovsky, P. ... (2018). TF-Replicator: Distributed Machine Learning for Researchers. arXiv preprint arXiv:1902.00465.
  • Kang, D., Emmons, J., Abuzaid, F., Bailis, P., & Zaharia, M. (2017). NoScope: optimizing neural network queries over video at scale. Proceedings of the VLDB Endowment, 10(11), 1586-1597.
  • Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). Why Should I Trust You? Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1135-1144).
  • Xing, E. P., Ho, Q., Xie, P., & Wei, D. (2016). Strategies and principles of distributed machine learning on big data. Engineering, 2(2), 179-195.
  • Salloum, S., Dautov, R., Chen, X., Peng, P. X., & Huang, J. Z. (2016). Big data analytics on Apache Spark. International Journal of Data Science and Analytics, 1(3-4), 145-164.
  • Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., ... & Dennison, D. (2015). Hidden technical debt in machine learning systems. In Advances in Neural Information Processing Systems (pp. 2503-2511).
  • Huang, Y., Zhu, F., Yuan, M., Deng, K., Li, Y., Ni, B., ... & Zeng, J. (2015). Telco Churn Prediction with Big Data. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (pp. 607-618).
  • Moritz, P., Nishihara, R., Stoica, I., & Jordan, M. I. (2015). SparkNet: Training Deep Networks in Spark. arXiv preprint arXiv:1511.06051.
  • Upadhyaya, S. R. (2013). Parallel approaches to machine learning—A comprehensive survey. Journal of Parallel and Distributed Computing, 73(3), 284-292.
  • Sakr, S., Liu, A., Batista, D. M., & Alomari, M. (2011). A survey of large scale data management approaches in cloud environments. IEEE Communications Surveys & Tutorials, 13(3), 311-336.
  • Abadi, D. J. (2009). Data management in the cloud: Limitations and opportunities. IEEE Data Eng. Bull., 32(1), 3-12.

Software[edit]

See also[edit]

Other Resources[edit]

General[edit]

Data Annotation & Labelling[edit]

EDA[edit]

Asynchronous Communication & Microservices[edit]

Distributed Systems[edit]

Deployment and Production[edit]