Difference between revisions of "Data Science"

From Ioannis Kourouklides
Jump to navigation Jump to search
 
(188 intermediate revisions by 7 users not shown)
Line 1: Line 1:
This page contains resources about [https://en.wikipedia.org/wiki/Data_science Data Science], including '''Data Engineering'''.
+
This page contains resources about [https://en.wikipedia.org/wiki/Data_science Data Science], '''Data Engineering''' and [https://en.wikipedia.org/wiki/Data_management Data Management].
   
 
== Subfields and Concepts ==
 
== Subfields and Concepts ==
  +
* Agile Data Science
 
* [[Machine Learning]] / Data Mining
 
* [[Machine Learning]] / Data Mining
* Exploratory Data Analysis
+
* Exploratory Data Analysis (EDA)
* Data Preparation and Preprocessing
+
* Data Preparation and Data Preprocessing
  +
* Data Fusion and Data Integration
* Parallel/Distributed/Concurrent Computing for Machine Learning
 
* Data Engineering and Databases
+
* Data Wrangling / Data Munging
  +
* Data Scraping
  +
* Data Sampling
  +
* Data Cleaning
 
* Data Visualization
 
* Data Visualization
  +
* Explainable AI (XAI) / Interpretable AI
 
* Big Data
 
* Big Data
  +
* Data Engineering, Data Management and Databases
  +
* High Performance/Parallel/Distributed/Cloud Computing for Machine Learning
  +
* Concurrent/Multi-threading Computing for Machine Learning
  +
* Synchronous Communication (for Web Services)
  +
** Representational State Transfer (REST) Protocol
  +
** Remote Procedure Call (RPC)
  +
** Simple Object Access Protocol (SOAP)
  +
* Asynchronous Communication / Asynchronous Messaging (for Web Services)
  +
** Message broker/Message bus/Event bus/Integration broker/Interface engine
  +
** Message queue
  +
** Asynchronous protocols
  +
*** Advanced Message Queuing Protocol (AMQP)
  +
*** MQ Telemetry Transport (MQTT)
  +
* Messaging patterns
  +
** Fire-and-Forget / One-Way
  +
** Request-Response / Request-Reply
  +
** Publisher-Subscriber
  +
** Request-Callback
  +
* Software Architecture
  +
** Monolithic Architecture
  +
** Microservices Architecture
  +
** Service-Oriented Architecture (SOA)
  +
* Stream Processing
   
 
== Online courses ==
 
== Online courses ==
   
 
=== Video Lectures ===
 
=== Video Lectures ===
* [https://www.youtube.com/watch?v=W5WE9Db2RLU Exploratory data analysis in Python by Chloe Mawer and Jonathan Whitmore] - PyCon 2017
+
* [https://www.coursera.org/learn/competitive-data-science How to Win a Data Science Competition: Learn from Top Kagglers] - Coursera
 
   
 
=== Lecture Notes ===
 
=== Lecture Notes ===
* [https://www.slideshare.net/kourouklides/what-is-data-science-99294704/ What is Data Science by Ioannis Kourouklides]
+
* [https://goo.gl/VSTGUQ Data Science by Ioannis Kourouklides]
 
* [https://www.csie.ntu.edu.tw/~cjlin/talks/bigdata-bilbao.pdf When <nowiki> [to use] </nowiki> and When Not to Use Distributed Machine Learning by Chih-Jen Lin]
 
* [https://www.csie.ntu.edu.tw/~cjlin/talks/bigdata-bilbao.pdf When <nowiki> [to use] </nowiki> and When Not to Use Distributed Machine Learning by Chih-Jen Lin]
 
* [https://medium.com/open-machine-learning-course/open-machine-learning-course-topic-1-exploratory-data-analysis-with-pandas-de57880f1a68 Open Machine Learning Course] (Medium)
 
* [https://medium.com/open-machine-learning-course/open-machine-learning-course-topic-1-exploratory-data-analysis-with-pandas-de57880f1a68 Open Machine Learning Course] (Medium)
 
* [http://www.mmds.org/#mooc Mining Massive Datasets by Jure Leskovec, Anand Rajaraman and Jeff Ullman]
 
* [http://www.mmds.org/#mooc Mining Massive Datasets by Jure Leskovec, Anand Rajaraman and Jeff Ullman]
  +
* [https://www.systems.ethz.ch/courses/fall2017/hadp Hardware Acceleration for Data Processing by Gustavo Alonso]
  +
* [http://cs109.github.io/2015/ CS109: Data Science]
   
 
==Books==
 
==Books==
  +
* Newman, S. (2021). ''Building Microservices: Designing Fine-Grained Systems''. 2nd Ed. O'Reilly Media.
* Tukey, J. W. (1977). ''Exploratory data analysis''. Addison-Wesley.
 
* Schutt, R., & O'Neil, C. (2013). ''Doing data science: Straight talk from the frontline''. O'Reilly Media.
+
* Bellemare, A. (2020). ''Building Event-Driven Microservices: Leveraging Organizational Data at Scale''. O'Reilly Media.
  +
* Richards, M. (2020). ''Fundamentals of Software Architecture''. O'Reilly Media.
* Leskovec, J., Rajaraman, A., & Ullman, J. D. (2014). Mining of massive datasets. Cambridge University Press. ([http://www.mmds.org/ link])
 
* Zumel, N., Mount, J., & Porzak, J. (2014). ''Practical data science with R''. Manning.
+
* Dean A., & Crettaz, V. (2019). ''Event Streams in Action''. Manning.
  +
* Richardson, C. (2018). ''Microservices Patterns''. Manning Publications.
  +
* Pacheco, V. F. (2018). ''Microservice Patterns and Best Practices''. Packt Publishing.
  +
* De la Torre C., Wagner, B., & Rousos, M. (2018). ''.NET Microservices: Architecture for Containerized .NET Applications''. Microsoft Corporation. ([https://github.com/dzfweb/microsoft-microservices-book link])
  +
* Lanaro, G. (2017). ''Python High Performance''. Packt Publishing.
  +
* Wickham, H., & Grolemund, G. (2017). ''R for Data Science''. O'Reilly Media.
  +
* Kleppmann, M. (2017). ''Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems''. O'Reilly Media.
  +
* VanderPlas, J. (2016). ''Python Data Science Handbook: Essential Tools for Working with Data''. O'Reilly Media.
  +
* Pierfederici, F. (2016). ''Distributed Computing with Python''. Packt Publishing.
  +
* Dunning, T., & Friedman, E. (2016). ''Streaming Architecture: New Designs Using Apache Kafka and MapR Streams.'' O'Reilly Media.
 
* Nolan, D., & Lang, D. T. (2015). ''Data Science in R: A Case Studies Approach to Computational Reasoning and Problem Solving''. CRC Press.
 
* Nolan, D., & Lang, D. T. (2015). ''Data Science in R: A Case Studies Approach to Computational Reasoning and Problem Solving''. CRC Press.
 
* Elston, S. F. (2015). ''Data Science in the Cloud with Microsoft Azure Machine Learning and R.'' O'Reilly Media, Inc.
 
* Elston, S. F. (2015). ''Data Science in the Cloud with Microsoft Azure Machine Learning and R.'' O'Reilly Media, Inc.
 
* Grus, J. (2015). ''Data Science from Scratch: First Principles with Python''. O'Reilly Media.
 
* Grus, J. (2015). ''Data Science from Scratch: First Principles with Python''. O'Reilly Media.
* Madhavan, S. (2015). ''Mastering Python for Data Science''. Packt Publishing Ltd.
+
* Madhavan, S. (2015). ''Mastering Python for Data Science''. Packt Publishing.
  +
* Kale, V. (2015). ''Guide to Cloud Computing for Business and Technology Managers: From Distributed Computing to Cloudware Applications''. CRC Press.
* Blum, A., Hopcroft, J., & Kannan, R. (2015). Foundations of Data Science.
 
* VanderPlas, J. (2016). ''Python Data Science Handbook: Essential Tools for Working with Data''. O'Reilly Media.
+
* Ejsmont, A. (2015). ''Web Scalability for Startup Engineers''. McGraw Hill.
  +
* Leskovec, J., Rajaraman, A., & Ullman, J. D. (2014). ''Mining of Massive Datasets''. Cambridge University Press. ([http://www.mmds.org/ link])
* Wickham, H., & Grolemund, G. (2017). ''R for Data Science''. O'Reilly Media.
 
  +
* Zumel, N., Mount, J., & Porzak, J. (2014). ''Practical Data Science with R''. Manning.
  +
* Schutt, R., & O'Neil, C. (2013). ''Doing Data Science: Straight Talk from the Frontline''. O'Reilly Media.
  +
* Videla, A., & J.W. Williams, J. (2012). ''RabbitMQ in Action''. Manning.
  +
* Tukey, J. W. (1977). ''Exploratory Data Analysis''. Addison-Wesley.
  +
  +
==Scholarly Articles==
  +
* Verbraeken, J., Wolting, M., Katzy, J., Kloppenburg, J., Verbelen, T., & Rellermeyer, J. S. (2020). A Survey on Distributed Machine Learning. ''ACM Computing Surveys (CSUR), 53''(2), 1-33.
  +
* Buchlovsky, P. ... (2018). TF-Replicator: Distributed Machine Learning for Researchers. ''arXiv preprint arXiv:1902.00465.''
  +
* Kang, D., Emmons, J., Abuzaid, F., Bailis, P., & Zaharia, M. (2017). NoScope: optimizing neural network queries over video at scale. ''Proceedings of the VLDB Endowment, 10''(11), 1586-1597.
  +
* Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). Why Should I Trust You? Explaining the Predictions of Any Classifier. In ''Proceedings of the 22nd [https://en.wikipedia.org/wiki/SIGKDD ACM SIGKDD International Conference on Knowledge Discovery and Data Mining]'' (pp. 1135-1144).
  +
* Xing, E. P., Ho, Q., Xie, P., & Wei, D. (2016). Strategies and principles of distributed machine learning on big data. ''Engineering, 2''(2), 179-195.
  +
* Salloum, S., Dautov, R., Chen, X., Peng, P. X., & Huang, J. Z. (2016). Big data analytics on Apache Spark. ''International Journal of Data Science and Analytics, 1''(3-4), 145-164.
  +
* Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., ... & Dennison, D. (2015). Hidden technical debt in machine learning systems. In ''[https://en.wikipedia.org/wiki/Conference_on_Neural_Information_Processing_Systems Advances in Neural Information Processing Systems]'' (pp. 2503-2511).
  +
* Huang, Y., Zhu, F., Yuan, M., Deng, K., Li, Y., Ni, B., ... & Zeng, J. (2015). Telco Churn Prediction with Big Data. In ''Proceedings of the 2015 [https://en.wikipedia.org/wiki/SIGMOD ACM SIGMOD International Conference on Management of Data]'' (pp. 607-618).
  +
* Moritz, P., Nishihara, R., Stoica, I., & Jordan, M. I. (2015). SparkNet: Training Deep Networks in Spark. ''arXiv preprint arXiv:1511.06051''.
  +
* Upadhyaya, S. R. (2013). Parallel approaches to machine learning—A comprehensive survey. ''Journal of Parallel and Distributed Computing, 73''(3), 284-292.
  +
* Sakr, S., Liu, A., Batista, D. M., & Alomari, M. (2011). A survey of large scale data management approaches in cloud environments. ''IEEE Communications Surveys & Tutorials, 13''(3), 311-336.
  +
* Abadi, D. J. (2009). Data management in the cloud: Limitations and opportunities. ''IEEE Data Eng. Bull., 32''(1), 3-12.
   
 
==Software==
 
==Software==
  +
* [https://www.docker.com/ Docker] (Containers)
  +
* [https://www.anaconda.com/ Anaconda Distribution] - Python
  +
* [https://cython.org/ Cython] - Python
 
* [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Beautiful Soup 4] - Python
 
* [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Beautiful Soup 4] - Python
  +
* [https://lxml.de/ lxml] - Python
  +
* [https://selenium-python.readthedocs.io/ Selenium] - Python
  +
* [https://doc.scrapy.org/en/latest/index.html Scrapy] - Python
 
* [https://github.com/ray-project/ray ray] - Python
 
* [https://github.com/ray-project/ray ray] - Python
  +
* [https://docs.python.org/3.4/library/multiprocessing.html multiprocessing] - Python
* [https://www.elastic.co/products/elasticsearch Elasticsearch]
 
  +
* [https://docs.python.org/3.4/library/threading.html threading] - Python
  +
* [https://github.com/ClimbsRocks/auto_ml auto_ml] - Python
  +
* [https://docs.celeryproject.org/en/stable/getting-started/introduction.html Celery] - Python
  +
* [https://www.elastic.co/products/elasticsearch Elasticsearch], [https://www.elastic.co/products/logstash Logstash], [https://www.elastic.co/products/kibana Kibana] (ELK)
 
* [https://www.mongodb.com/ MongoDB]
 
* [https://www.mongodb.com/ MongoDB]
 
* [http://lucene.apache.org/solr/ Apache Solr]
 
* [http://lucene.apache.org/solr/ Apache Solr]
Line 45: Line 111:
 
* [https://spark.apache.org/ Apache Spark]
 
* [https://spark.apache.org/ Apache Spark]
 
* [https://hive.apache.org/ Apache Hive]
 
* [https://hive.apache.org/ Apache Hive]
* [http://kafka.apache.org/ Apache Kafka], which includes [https://www.confluent.io/product/connectors/ Kafka Connect]
 
 
* [http://cassandra.apache.org/ Apache Cassandra]
 
* [http://cassandra.apache.org/ Apache Cassandra]
 
* [https://zookeeper.apache.org/ Apache ZooKeeper]
 
* [https://zookeeper.apache.org/ Apache ZooKeeper]
Line 52: Line 117:
 
* [http://couchdb.apache.org/ Apache CouchDB]
 
* [http://couchdb.apache.org/ Apache CouchDB]
 
* [http://activemq.apache.org/ Apache ActiveMQ]
 
* [http://activemq.apache.org/ Apache ActiveMQ]
* [https://www.rabbitmq.com/ RabbitMQ]
+
* [http://samza.apache.org/ Apache Samza]
  +
* [https://flink.apache.org/ Apache Flink]
  +
* [http://kafka.apache.org/ Apache Kafka] (which includes [https://www.confluent.io/product/connectors/ Kafka Connect]) - A message broker
  +
* [https://www.rabbitmq.com/ RabbitMQ] - A message broker
  +
* [https://redis.io/ Redis] - A message broker
  +
* [https://spark.apache.org/docs/latest/api/python/index.html pyspark] - Spark Python API
 
* [http://platanios.org/tensorflow_scala/ tensorflow_scala] - Scala API for TensorFlow
 
* [http://platanios.org/tensorflow_scala/ tensorflow_scala] - Scala API for TensorFlow
  +
* [https://github.com/migueldeicaza/TensorFlowSharp TensorFlowSharp] - TensorFlow API for .NET languages
  +
* [https://github.com/yahoo/TensorFlowOnSpark TensorFlowOnSpark] - It brings TensorFlow programs onto Apache Spark clusters
  +
* [https://numba.pydata.org/ Numba] - Python
  +
* [https://graphql.org/ GraphQL]
  +
* [https://www.nginx.com/ nginx]
  +
* [https://dvc.org/ DVC] - Data Version Control
  +
* [https://www.kubeflow.org/ kubeflow]
  +
* [https://akka.io/ Akka]
  +
* [https://www.pykka.org/ Pykka]
  +
* [https://apache.github.io/incubator-heron/ Heron]
  +
* [https://airflow.apache.org/ Apache Airflow] - Workflow Management System
  +
* [http://druid.io/ Druid]
  +
* [https://superset.incubator.apache.org/druid.html Apache Superset]
  +
* [https://github.com/horovod/horovod Horovod] - TensorFlow, Keras, PyTorch, and MXNet
  +
* [https://www.acumos.org/ Acumos AI]
  +
* [https://hopsworks.readthedocs.io/en/0.9/hopsml/hopsML.html HopsML]
  +
* [https://arrow.apache.org/ Apache Arrow]
   
 
==See also==
 
==See also==
Line 59: Line 146:
   
 
==Other Resources==
 
==Other Resources==
  +
===General===
  +
*[https://www.slideshare.net/kourouklides/what-is-data-science-99294704/ What is Data Science by Ioannis Kourouklides] - slides
 
*[https://datascienceguide.github.io/ Data Science Guide]
 
*[https://datascienceguide.github.io/ Data Science Guide]
 
*[http://jadianes.me/data-science-your-way/ Data Science Engineering, your way]
 
*[http://jadianes.me/data-science-your-way/ Data Science Engineering, your way]
  +
*[https://www.oreilly.com/ideas/a-manifesto-for-agile-data-science A manifesto for Agile data science] - blog post
  +
*[https://towardsdatascience.com/data-science-project-flow-for-startups-282a93d4508d Data Science Project Flow for Startups] - blog post
 
*[http://www.cse.ust.hk/~kxmo/LargeML.html Large Scale Machine Learning] - libraries and papers
 
*[http://www.cse.ust.hk/~kxmo/LargeML.html Large Scale Machine Learning] - libraries and papers
 
*[https://www.quora.com/What-are-some-courses-on-large-scale-learning What are some courses on large scale learning?] - Quora
 
*[https://www.quora.com/What-are-some-courses-on-large-scale-learning What are some courses on large scale learning?] - Quora
 
*[https://www.kdnuggets.com/2017/06/7-steps-mastering-data-preparation-python.html 7 Steps to Mastering Data Preparation with Python] - blog post
 
*[https://www.kdnuggets.com/2017/06/7-steps-mastering-data-preparation-python.html 7 Steps to Mastering Data Preparation with Python] - blog post
 
*[https://www.kdnuggets.com/2017/12/baesens-web-scraping-data-science-python.html Web Scraping for Data Science with Python] - blog post
 
*[https://www.kdnuggets.com/2017/12/baesens-web-scraping-data-science-python.html Web Scraping for Data Science with Python] - blog post
*[https://medium.com/@Petuum/intro-to-distributed-deep-learning-systems-a2e45c6b8e7 Intro to Distributed Deep Learning Systems] - blog post
 
 
*[http://vlad17.github.io/COS513-Blog/ Princeton Commodities Modeling Blog]
 
*[http://vlad17.github.io/COS513-Blog/ Princeton Commodities Modeling Blog]
*[https://github.com/ajaymache/data-analysis-using-python Exploratory data analysis using Python for used car database taken from Kaggle] - Github
 
*[https://www.kaggle.com/ekami66/detailed-exploratory-data-analysis-with-python Detailed exploratory data analysis with Python] - Kaggle
 
 
*[https://github.com/upalr/Python-camp Python-camp] - Github
 
*[https://github.com/upalr/Python-camp Python-camp] - Github
 
*[http://mtitek.com/big-data.php Big Data: Spark, Hadoop, Hive, ZooKeeper, Solr, Kafka, Nutch, MongoDB, ...] - installation instructions
 
*[http://mtitek.com/big-data.php Big Data: Spark, Hadoop, Hive, ZooKeeper, Solr, Kafka, Nutch, MongoDB, ...] - installation instructions
 
*[https://databricks.com/blog/2016/01/25/deep-learning-with-apache-spark-and-tensorflow.html Deep Learning with Apache Spark and TensorFlow] - blog post
 
*[https://databricks.com/blog/2016/01/25/deep-learning-with-apache-spark-and-tensorflow.html Deep Learning with Apache Spark and TensorFlow] - blog post
 
*[https://khartig.wordpress.com/2017/12/30/build-a-simple-chatbot-with-tensorflow-python-and-mongodb/ Build a Simple Chatbot with Tensorflow, Python and MongoDB] - blog post
 
*[https://khartig.wordpress.com/2017/12/30/build-a-simple-chatbot-with-tensorflow-python-and-mongodb/ Build a Simple Chatbot with Tensorflow, Python and MongoDB] - blog post
*[https://medium.com/open-machine-learning-course/open-machine-learning-course-topic-2-visual-data-analysis-in-python-846b989675cd Visual Data Analysis with Python] - blog post
 
*[https://medium.com/open-machine-learning-course/open-machine-learning-course-topic-1-exploratory-data-analysis-with-pandas-de57880f1a68 Exploratory Data Analysis with Pandas] - blog post
 
 
*[https://plot.ly/python/maps/ Plotly Python Library Maps]
 
*[https://plot.ly/python/maps/ Plotly Python Library Maps]
 
*[https://towardsdatascience.com/5-quick-and-easy-data-visualizations-in-python-with-code-a2284bae952f 5 Quick and Easy Data Visualizations in Python with Code] - blog post
 
*[https://towardsdatascience.com/5-quick-and-easy-data-visualizations-in-python-with-code-a2284bae952f 5 Quick and Easy Data Visualizations in Python with Code] - blog post
 
*[https://medium.com/@williamkoehrsen William Koehrsen] - blog
 
*[https://medium.com/@williamkoehrsen William Koehrsen] - blog
  +
*[http://www.claoudml.co/ ClaoudML] - Free Data Science & Machine Learning Resources
  +
*[https://www.datasciencecentral.com/profiles/blogs/data-science-in-python-pandas-cheat-sheet Data Science in Python: Pandas Cheat Sheet]
  +
*[https://www.elastic.co/webinars/time-series-anomaly-detection-optimizing-machine-learning-jobs-in-elasticsearch Time Series Anomaly Detection: Optimizing your Machine Learning Jobs in Elasticsearch] - webinar
  +
*[https://ai.googleblog.com/2017/04/federated-learning-collaborative.html Federated Learning: Collaborative Machine Learning without Centralized Training Data] - blog post
  +
*[https://www.zurich.ibm.com/snapml/ Snap ML] - IBM
  +
*[https://github.com/vsmolyakov/pyspark pyspark (GitHub)] - collection of resources
  +
*[https://becominghuman.ai/cheat-sheets-for-ai-neural-networks-machine-learning-deep-learning-big-data-678c51b4b463 Cheat Sheets for AI, Neural Networks, Machine Learning, Deep Learning & Big Data] - blog post
  +
*[https://eng.uber.com/peloton/ Peloton: Uber’s Unified Resource Scheduler for Diverse Cluster Workloads] - blog post
  +
*[http://martin.zinkevich.org/rules_of_ml/rules_of_ml.pdf Rules of Machine Learning: Best Practices for ML Engineering] - blog post
  +
*[https://blog.kovalevskyi.com/google-compute-engine-now-has-images-with-pytorch-1-0-0-and-fastai-1-0-2-57c49efd74bb Google Compute Engine Now Has Images With PyTorch 1.0.0 and FastAi 1.0.2] - blog post
  +
*[https://eng.uber.com/michelangelo-pyml/ Michelangelo PyML: Introducing Uber’s Platform for Rapid Python ML Model Development]
  +
*[https://towardsdatascience.com/manage-your-data-science-project-structure-in-early-stage-95f91d4d0600 Manage your Data Science project structure in early stage] - blog post
  +
*[https://medium.com/@rrfd/cookiecutter-data-science-organize-your-projects-atom-and-jupyter-2be7862f487e Cookiecutter Data Science — Organize your Projects — Atom and Jupyter] - blog post
  +
*[https://github.com/SurrealAI/surreal surreal (GitHub)] - code
  +
*[https://github.com/SurrealAI/cloudwise cloudwise (GitHub)] - code
  +
*[https://github.com/SurrealAI/caraml caraml (GitHub)] - code
  +
*[https://github.com/SurrealAI/symphony symphony (GitHub)] - code
  +
*[https://www.analyticsindiamag.com/tensorflow-vs-spark-differ-work-tandem TensorFlow Vs. Spark: How Do They Differ And Work In Tandem With Each Other] - blog post
  +
*[https://github.com/bulutyazilim/awesome-datascience awesome-datascience (GitHub)]
  +
*[https://github.com/siboehm/awesome-learn-datascience awesome-learn-datascience (GitHub)]
  +
*[https://www.logicalclocks.com/blog/when-deep-learning-with-gpus-use-a-cluster-manager When Deep Learning with GPUs, use a Cluster Manager] - blog post
  +
  +
===Data Annotation & Labelling===
  +
*[https://appen.com/blog/data-annotation/ What is Data Annotation?]
  +
*[https://www.mturk.com Amazon Mechanical Turk]
  +
*[https://www.cloudfactory.com/ CloudFactory]
  +
*[https://appen.com/ Appen]
  +
*[https://www.alegion.com/ Alegion]
  +
*[https://imerit.net/ iMerit]
  +
*[https://playment.io/ Playment]
  +
*[https://www.rev.com/ Rev] - Transcription from video and audio
  +
*[https://labelbox.com/ Labelbox]
  +
*[https://github.com/diffgram/diffgram diffgram]
  +
*[https://dl.acm.org/citation.cfm?id=1866696 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk]
  +
*[https://www.cloudfactory.com/data-annotation-tool-guide Data Annotation Tools for Machine Learning (Evolving Guide)]
  +
*[https://github.com/taivop/awesome-data-annotation awesome-data-annotation (GitHub)]
  +
  +
===EDA===
  +
*[https://github.com/ajaymache/data-analysis-using-python Exploratory data analysis using Python for used car database taken from Kaggle] - Github
  +
*[https://www.kaggle.com/ekami66/detailed-exploratory-data-analysis-with-python Detailed exploratory data analysis with Python] - Kaggle
  +
*[https://www.youtube.com/watch?v=W5WE9Db2RLU Exploratory data analysis in Python - PyCon 2017 (Youtube)]
  +
*[https://medium.com/open-machine-learning-course/open-machine-learning-course-topic-1-exploratory-data-analysis-with-pandas-de57880f1a68 Exploratory Data Analysis with Pandas] - blog post
  +
*[https://www.kaggle.com/randylaosat/simple-exploratory-data-analysis-passnyc Simple Exploratory Data Analysis - PASSNYC] - Kaggle
  +
*[https://www.kaggle.com/moizzz/eda-and-clustering EDA and Clustering] - Kaggle
  +
*[https://medium.com/python-pandemonium/introduction-to-exploratory-data-analysis-in-python-8b6bcb55c190 Introduction to Exploratory Data Analysis in Python] - blog post
  +
*[https://medium.com/open-machine-learning-course/open-machine-learning-course-topic-2-visual-data-analysis-in-python-846b989675cd Visual Data Analysis with Python] - blog post
  +
  +
===Asynchronous Communication & Microservices===
  +
*[https://microservices.io/patterns/microservices.html Pattern: Microservice Architecture]
  +
*[https://www.dineshonjava.com/software-architecture-patterns-and-designs/ Software Architecture Patterns and Designs]
  +
*[https://codeblog.dotsandbrackets.com/asynchronous-communication-with-message-queue/ Asynchronous communication with message queue]
  +
*[https://garba.org/article/general/soa/mep.html Message Exchange Patterns (MEPs)]
  +
*[https://flylib.com/books/en/2.365.1/message_exchange_patterns.html Message exchange patterns]
  +
*[https://docs.microsoft.com/en-us/azure/architecture/patterns/category/messaging Messaging patterns]
  +
*[https://medium.com/@mmz.zaeimi/synchronous-vs-asynchronous-communication-in-microservices-integration-f4dd36478fd2 Synchronous vs Asynchronous communication in microservices integration]
  +
*[https://otonomo.io/blog/redis-kafka-or-rabbitmq-which-microservices-message-broker-to-choose/ Redis, Kafka or RabbitMQ: Which MicroServices Message Broker To Choose?]
  +
*[https://dzone.com/articles/akka-streams-and-kafka-streams-where-microservices Akka Streams and Kafka Streams: Where Microservices Meet Fast Data]
  +
*[https://dzone.com/articles/akka-spark-or-kafka-selecting-the-right-streaming Akka, Spark, or Kafka? Selecting the Right Streaming Engine]
  +
*[https://otonomo.io/blog/luigi-airflow-pinball-and-chronos-comparing-workflow-management-systems/ Luigi, Airflow, Pinball, and Chronos: Comparing Workflow Management Systems]
  +
*[https://github.com/kaiwaehner/kafka-streams-machine-learning-examples kafka-streams-machine-learning-examples (GitHub)] - Machine Learning + Kafka Streams Examples (with code)
  +
*[https://aseigneurin.github.io/2018/09/05/realtime-machine-learning-predictions-wth-kafka-and-h2o.html Realtime Machine Learning predictions with Kafka and H2O.ai] - blog post
  +
*[https://tanzu.vmware.com/content/blog/understanding-when-to-use-rabbitmq-or-apache-kafka Understanding When to use RabbitMQ or Apache Kafka]
  +
*[https://www.ververica.com/what-is-stream-processing What is Stream Processing?]
  +
*[https://medium.com/stream-processing/what-is-stream-processing-1eadfca11b97 A Gentle Introduction to Stream Processing]
  +
  +
=== Distributed Systems===
  +
*[https://blog.docker.com/2016/10/docker-distributed-system-summit-videos-podcast-episodes/ Docker Distributed System Summit videos podcast episodes]
  +
*[https://www.voltdb.com/files/using-docker-simplify-distributed-systems-development/ Using Docker to Simplify Distributed Systems in Development] - video
  +
*[https://medium.com/@harinilabs/day-11-getting-started-with-docker-and-using-it-to-build-deploy-a-distributed-app-1929669064b8 Day 11: Using Docker to build and deploy a distributed app] - blog post with [https://github.com/harinij/100DaysOfCode/tree/master/Day%20011%20-%20Docker%20WebApp code]
  +
*[https://medium.com/@Petuum/intro-to-distributed-deep-learning-systems-a2e45c6b8e7 Intro to Distributed Deep Learning Systems] - blog post
  +
*[https://www.systems.ethz.ch/sites/default/files/parallel-distributed-deep-learning.pdf Parallel and Distributed Deep Learning by Tal Ben-Nun]
  +
*[https://sebastianraschka.com/Articles/2014_multiprocessing.html An introduction to parallel programming using Python's multiprocessing module] - blog post
  +
*[https://www.kdnuggets.com/2015/12/spark-deep-learning-training-with-sparknet.html Spark + Deep Learning: Distributed Deep Neural Network Training with SparkNet] - blog post
  +
*[http://muratbuffalo.blogspot.com/2016/04/petuum-new-platform-for-distributed.html Paper Review. Petuum: A new platform for distributed machine learning on big data] - blog post
  +
*[http://www.cheerml.com/comparison-distributed-ml-platform A comparison of distributed machine learning platform] - blog post
  +
*[https://www.logicalclocks.com/why-you-need-a-distributed-filesystem-for-deep-learning/ Distributed Filesystems for Deep Learning] - blog post
  +
*[https://github.com/tmulc18/Distributed-TensorFlow-Guide Distributed-TensorFlow-Guide (GitHub)] - Distributed TensorFlow basics and examples of training algorithms (with code)
  +
  +
===Deployment and Production===
  +
*[https://towardsdatascience.com/how-docker-can-help-you-become-a-more-effective-data-scientist-7fc048ef91d5 How Docker Can Help You Become A More Effective Data Scientist] - blog post
  +
*[https://www.confluent.io/blog/build-deploy-scalable-machine-learning-production-apache-kafka/ How to Build and Deploy Scalable Machine Learning in Production with Apache Kafka] - blog post
  +
*[https://towardsdatascience.com/deploying-deep-learning-models-part-1-an-overview-77b4d01dd6f7 Deploying deep learning models: Part 1 an overview] - blog post
  +
*[https://medium.com/@maheshkkumar/a-guide-to-deploying-machine-deep-learning-model-s-in-production-e497fd4b734a A guide to deploying Machine/Deep Learning model(s) in Production] - blog post
  +
*[https://medium.com/redbus-in/how-to-deploy-scikit-learn-ml-models-d390b4b8ce7a How redBus uses Scikit-Learn ML models to classify customer complaints?] - blog post
  +
*[https://willk.online/deploying-a-keras-deep-learning-model-as-a-web-application-in-p/ Deploying a Keras Deep Learning Model as a Web Application in Python] - blog post
  +
*[https://awesome-docker.netlify.com/ Awesome-docker] - A curated list of Docker resources and projects
  +
*[https://ramitsurana.github.io/awesome-kubernetes/ Awesome-Kubernetes] - A curated list for awesome kubernetes sources
  +
*[https://www.youtube.com/watch?v=zxcvyrhmjbc Michael Herman - Going Serverless with OpenFaaS, Kubernetes, and Python - PyCon 2018 (Youtube)]
  +
*[https://www.youtube.com/watch?v=jbb1dbFaovg Aly Sivji, Joe Jasinski, tathagata dasgupta (t) - Docker for Data Science - PyCon 2018 (Youtube)]
  +
*[https://www.youtube.com/watch?v=kx-048qE-TI Ruben Orduz, Nolan Brubaker - A Python-flavored Introduction to Containers And Kubernetes (Youtube)] - PyCon 2018
  +
*[https://www.youtube.com/watch?v=nrzLdMWTRMM Miguel Grinberg - Microservices with Python and Flask - PyCon 2017 (Youtube)]
  +
*[https://www.youtube.com/watch?v=EuzoEaE6Cqs Deploy and scale containers with Docker native, open source orchestration PyCon 2017 (Youtube)]
  +
*[https://www.youtube.com/watch?v=tdIIJuPh3SI Miguel Grinberg - Flask at Scale - PyCon 2016 (Youtube)]
  +
*[https://www.youtube.com/watch?v=GpHMTR7P2Ms Deploying and scaling applications with Docker, Swarm, and a tiny bit of Python magic - PyCon 2016 (Youtube)]
  +
*[https://www.youtube.com/watch?v=ZVaRK10HBjo Jérôme Petazzoni - Introduction to Docker and containers - PyCon 2016 (Youtube)]
  +
*[https://www.youtube.com/watch?v=DIcpEg77gdE Miguel Grinberg - Flask Workshop - PyCon 2015 (Youtube)]
  +
*[https://www.youtube.com/watch?v=YiZkHUbE6N0 Andrew T. Baker - Docker 101: Introduction to Docker - PyCon 2015 (Youtube)]
  +
*[https://www.youtube.com/watch?v=FGrIyBDQLPg Miguel Grinberg: Flask by Example - PyCon 2014 (Youtube)]
  +
*[https://towardsdatascience.com/learn-to-build-machine-learning-services-prototype-real-applications-and-deploy-your-work-to-aa97b2b09e0c Learn to Build Machine Learning Services, Prototype Real Applications, and Deploy your Work to Users] - blog post
  +
*[https://towardsdatascience.com/deploying-keras-deep-learning-models-with-flask-5da4181436a2 Deploying Keras Deep Learning Models with Flask] - blog post
  +
*[https://www.twilio.com/engineering/2012/10/18/open-sourcing-flask-restful Introducing Flask-RESTful] - blog post
  +
*[https://towardsdatascience.com/develop-a-nlp-model-in-python-deploy-it-with-flask-step-by-step-744f3bdd7776 Develop a NLP Model in Python & Deploy It with Flask, Step by Step] - blog post
  +
*[https://www.youtube.com/watch?v=knAFR4u73Es Deploying Machine Learning apps with Docker containers - MUPy 2017] - video
  +
*[https://medium.com/@patrickmichelberger/getting-started-with-anaconda-docker-b50a2c482139 Getting started with Anaconda & Docker] - blog post
  +
*[https://towardsdatascience.com/docker-for-data-science-9c0ce73e8263 Docker for Data Science] - blog post
  +
*[https://towardsdatascience.com/how-docker-can-help-you-become-a-more-effective-data-scientist-7fc048ef91d5 How Docker Can Help You Become A More Effective Data Scientist] - blog post
  +
*[https://becominghuman.ai/docker-for-data-science-part-1-dd41e5ef1d80 Simplified Docker-ing for Data Science — Part 1] - blog post
  +
*[https://www.born2data.com/2017/deeplearning_install-part4.html Deep Learning Installation Tutorial - Part 4: How to install Docker for Deep Learning ] - blog post
  +
*[https://towardsdatascience.com/how-to-write-a-production-level-code-in-data-science-5d87bd75ced How to write a production-level code in Data Science?] - blog post
  +
*[https://www.elastic.co/webinars/event-logs-in-elasticsearch-and-machine-learning Web Access Logs in Elasticsearch and Machine Learning] - webinar
  +
*[https://www.youtube.com/watch?v=f3I0izerPvc Deploying Python models to production] - video
  +
*[https://www.youtube.com/watch?v=-UYyyeYJAoQ How to deploy machine learning models into production] - video
  +
*[https://blog.cambridgespark.com/putting-machine-learning-models-into-production-d768560907bd Putting Machine Learning Models into Production] - blog post
  +
*[https://github.com/practicalAI/productionML productionML (GitHub)] - code for creating Production level API services for Machine Learning
  +
*[https://medium.com/kredaro-engineering/ai-tales-building-machine-learning-pipeline-using-kubeflow-and-minio-4b88da30437b AI Tales: Building Machine learning pipeline using Kubeflow and Minio] - blog post
  +
*[https://github.com/ahkarami/Deep-Learning-in-Production Deep-Learning-in-Production (GitHub)]
  +
*[https://medium.com/dataswati-garage/create-a-robust-ai-rest-api-71a8050ce314 Deploy your AI model the hard (and robust) way] - blog post

Latest revision as of 21:35, 24 November 2020

This page contains resources about Data Science, Data Engineering and Data Management.

Subfields and Concepts[edit]

  • Agile Data Science
  • Machine Learning / Data Mining
  • Exploratory Data Analysis (EDA)
  • Data Preparation and Data Preprocessing
  • Data Fusion and Data Integration
  • Data Wrangling / Data Munging
  • Data Scraping
  • Data Sampling
  • Data Cleaning
  • Data Visualization
  • Explainable AI (XAI) / Interpretable AI
  • Big Data
  • Data Engineering, Data Management and Databases
  • High Performance/Parallel/Distributed/Cloud Computing for Machine Learning
  • Concurrent/Multi-threading Computing for Machine Learning
  • Synchronous Communication (for Web Services)
    • Representational State Transfer (REST) Protocol
    • Remote Procedure Call (RPC)
    • Simple Object Access Protocol (SOAP)
  • Asynchronous Communication / Asynchronous Messaging (for Web Services)
    • Message broker/Message bus/Event bus/Integration broker/Interface engine
    • Message queue
    • Asynchronous protocols
      • Advanced Message Queuing Protocol (AMQP)
      • MQ Telemetry Transport (MQTT)
  • Messaging patterns
    • Fire-and-Forget / One-Way
    • Request-Response / Request-Reply
    • Publisher-Subscriber
    • Request-Callback
  • Software Architecture
    • Monolithic Architecture
    • Microservices Architecture
    • Service-Oriented Architecture (SOA)
  • Stream Processing

Online courses[edit]

Video Lectures[edit]

Lecture Notes[edit]

Books[edit]

  • Newman, S. (2021). Building Microservices: Designing Fine-Grained Systems. 2nd Ed. O'Reilly Media.
  • Bellemare, A. (2020). Building Event-Driven Microservices: Leveraging Organizational Data at Scale. O'Reilly Media.
  • Richards, M. (2020). Fundamentals of Software Architecture. O'Reilly Media.
  • Dean A., & Crettaz, V. (2019). Event Streams in Action. Manning.
  • Richardson, C. (2018). Microservices Patterns. Manning Publications.
  • Pacheco, V. F. (2018). Microservice Patterns and Best Practices. Packt Publishing.
  • De la Torre C., Wagner, B., & Rousos, M. (2018). .NET Microservices: Architecture for Containerized .NET Applications. Microsoft Corporation. (link)
  • Lanaro, G. (2017). Python High Performance. Packt Publishing.
  • Wickham, H., & Grolemund, G. (2017). R for Data Science. O'Reilly Media.
  • Kleppmann, M. (2017). Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. O'Reilly Media.
  • VanderPlas, J. (2016). Python Data Science Handbook: Essential Tools for Working with Data. O'Reilly Media.
  • Pierfederici, F. (2016). Distributed Computing with Python. Packt Publishing.
  • Dunning, T., & Friedman, E. (2016). Streaming Architecture: New Designs Using Apache Kafka and MapR Streams. O'Reilly Media.
  • Nolan, D., & Lang, D. T. (2015). Data Science in R: A Case Studies Approach to Computational Reasoning and Problem Solving. CRC Press.
  • Elston, S. F. (2015). Data Science in the Cloud with Microsoft Azure Machine Learning and R. O'Reilly Media, Inc.
  • Grus, J. (2015). Data Science from Scratch: First Principles with Python. O'Reilly Media.
  • Madhavan, S. (2015). Mastering Python for Data Science. Packt Publishing.
  • Kale, V. (2015). Guide to Cloud Computing for Business and Technology Managers: From Distributed Computing to Cloudware Applications. CRC Press.
  • Ejsmont, A. (2015). Web Scalability for Startup Engineers. McGraw Hill.
  • Leskovec, J., Rajaraman, A., & Ullman, J. D. (2014). Mining of Massive Datasets. Cambridge University Press. (link)
  • Zumel, N., Mount, J., & Porzak, J. (2014). Practical Data Science with R. Manning.
  • Schutt, R., & O'Neil, C. (2013). Doing Data Science: Straight Talk from the Frontline. O'Reilly Media.
  • Videla, A., & J.W. Williams, J. (2012). RabbitMQ in Action. Manning.
  • Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley.

Scholarly Articles[edit]

  • Verbraeken, J., Wolting, M., Katzy, J., Kloppenburg, J., Verbelen, T., & Rellermeyer, J. S. (2020). A Survey on Distributed Machine Learning. ACM Computing Surveys (CSUR), 53(2), 1-33.
  • Buchlovsky, P. ... (2018). TF-Replicator: Distributed Machine Learning for Researchers. arXiv preprint arXiv:1902.00465.
  • Kang, D., Emmons, J., Abuzaid, F., Bailis, P., & Zaharia, M. (2017). NoScope: optimizing neural network queries over video at scale. Proceedings of the VLDB Endowment, 10(11), 1586-1597.
  • Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). Why Should I Trust You? Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1135-1144).
  • Xing, E. P., Ho, Q., Xie, P., & Wei, D. (2016). Strategies and principles of distributed machine learning on big data. Engineering, 2(2), 179-195.
  • Salloum, S., Dautov, R., Chen, X., Peng, P. X., & Huang, J. Z. (2016). Big data analytics on Apache Spark. International Journal of Data Science and Analytics, 1(3-4), 145-164.
  • Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., ... & Dennison, D. (2015). Hidden technical debt in machine learning systems. In Advances in Neural Information Processing Systems (pp. 2503-2511).
  • Huang, Y., Zhu, F., Yuan, M., Deng, K., Li, Y., Ni, B., ... & Zeng, J. (2015). Telco Churn Prediction with Big Data. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (pp. 607-618).
  • Moritz, P., Nishihara, R., Stoica, I., & Jordan, M. I. (2015). SparkNet: Training Deep Networks in Spark. arXiv preprint arXiv:1511.06051.
  • Upadhyaya, S. R. (2013). Parallel approaches to machine learning—A comprehensive survey. Journal of Parallel and Distributed Computing, 73(3), 284-292.
  • Sakr, S., Liu, A., Batista, D. M., & Alomari, M. (2011). A survey of large scale data management approaches in cloud environments. IEEE Communications Surveys & Tutorials, 13(3), 311-336.
  • Abadi, D. J. (2009). Data management in the cloud: Limitations and opportunities. IEEE Data Eng. Bull., 32(1), 3-12.

Software[edit]

See also[edit]

Other Resources[edit]

General[edit]

Data Annotation & Labelling[edit]

EDA[edit]

Asynchronous Communication & Microservices[edit]

Distributed Systems[edit]

Deployment and Production[edit]