This wiki has no edits or logs made within the last 45 days, therefore it is marked as inactive. If you would like to prevent this wiki from being closed, please start showing signs of activity here. If there are no signs of this wiki being used within the next 15 days, this wiki may be closed per the Dormancy Policy. This wiki will then be eligible for adoption by another user. If not adopted and still inactive 135 days from now, this wiki will become eligible for deletion. Please be sure to familiarize yourself with Miraheze's Dormancy Policy. If you are a bureaucrat, you can go to Special:ManageWiki and uncheck "inactive" yourself. If you have any other questions or concerns, please don't hesitate to ask at Stewards' noticeboard.

Difference between revisions of "Data Science"

From Ioannis Kourouklides
Jump to navigation Jump to search
 
(39 intermediate revisions by the same user not shown)
Line 1: Line 1:
This page contains resources about [https://en.wikipedia.org/wiki/Data_science Data Science], including '''Data Engineering''' and [https://en.wikipedia.org/wiki/Data_management Data Management].
+
This page contains resources about [https://en.wikipedia.org/wiki/Data_science Data Science], '''Data Engineering''' and [https://en.wikipedia.org/wiki/Data_management Data Management].
   
 
== Subfields and Concepts ==
 
== Subfields and Concepts ==
Line 11: Line 11:
 
* Data Sampling
 
* Data Sampling
 
* Data Cleaning
 
* Data Cleaning
 
* Data Visualization
 
* Explainable AI (XAI) / Interpretable AI
 
* Big Data
 
* Data Engineering, Data Management and Databases
 
* High Performance/Parallel/Distributed/Cloud Computing for Machine Learning
 
* High Performance/Parallel/Distributed/Cloud Computing for Machine Learning
 
* Concurrent/Multi-threading Computing for Machine Learning
 
* Concurrent/Multi-threading Computing for Machine Learning
* Synchronous Communication (for web services)
+
* Synchronous Communication (for Web Services)
** Representational State Transfer (REST) protocol
+
** Representational State Transfer (REST) Protocol
  +
** Remote Procedure Call (RPC)
* Asynchronous Communication / Asynchronous Messaging (for web services)
 
  +
** Simple Object Access Protocol (SOAP)
 
* Asynchronous Communication / Asynchronous Messaging (for Web Services)
 
** Message broker/Message bus/Event bus/Integration broker/Interface engine
 
** Message broker/Message bus/Event bus/Integration broker/Interface engine
 
** Message queue
 
** Message queue
  +
** Asynchronous protocols
* Data Engineering, Data Management and Databases
 
  +
*** Advanced Message Queuing Protocol (AMQP)
* Data Visualization
 
  +
*** MQ Telemetry Transport (MQTT)
* Big Data
 
  +
* Messaging patterns
* Explainable AI (XAI) / Interpretable AI
 
  +
** Fire-and-Forget / One-Way
  +
** Request-Response / Request-Reply
  +
** Publisher-Subscriber
  +
** Request-Callback
  +
* Software Architecture
  +
** Monolithic Architecture
  +
** Microservices Architecture
  +
** Service-Oriented Architecture (SOA)
  +
* Stream Processing
   
 
== Online courses ==
 
== Online courses ==
Line 37: Line 52:
   
 
==Books==
 
==Books==
  +
* Newman, S. (2021). ''Building Microservices: Designing Fine-Grained Systems''. 2nd Ed. O'Reilly Media.
* Lanaro, G. (2017). ''Python High Performance''. Packt Publishing Ltd.
 
  +
* Bellemare, A. (2020). ''Building Event-Driven Microservices: Leveraging Organizational Data at Scale''. O'Reilly Media.
  +
* Richards, M. (2020). ''Fundamentals of Software Architecture''. O'Reilly Media.
  +
* Dean A., & Crettaz, V. (2019). ''Event Streams in Action''. Manning.
  +
* Richardson, C. (2018). ''Microservices Patterns''. Manning Publications.
  +
* Pacheco, V. F. (2018). ''Microservice Patterns and Best Practices''. Packt Publishing.
  +
* De la Torre C., Wagner, B., & Rousos, M. (2018). ''.NET Microservices: Architecture for Containerized .NET Applications''. Microsoft Corporation. ([https://github.com/dzfweb/microsoft-microservices-book link])
 
* Lanaro, G. (2017). ''Python High Performance''. Packt Publishing.
 
* Wickham, H., & Grolemund, G. (2017). ''R for Data Science''. O'Reilly Media.
 
* Wickham, H., & Grolemund, G. (2017). ''R for Data Science''. O'Reilly Media.
  +
* Kleppmann, M. (2017). ''Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems''. O'Reilly Media.
 
* VanderPlas, J. (2016). ''Python Data Science Handbook: Essential Tools for Working with Data''. O'Reilly Media.
 
* VanderPlas, J. (2016). ''Python Data Science Handbook: Essential Tools for Working with Data''. O'Reilly Media.
* Pierfederici, F. (2016). ''Distributed Computing with Python''. Packt Publishing Ltd.
+
* Pierfederici, F. (2016). ''Distributed Computing with Python''. Packt Publishing.
  +
* Dunning, T., & Friedman, E. (2016). ''Streaming Architecture: New Designs Using Apache Kafka and MapR Streams.'' O'Reilly Media.
 
* Nolan, D., & Lang, D. T. (2015). ''Data Science in R: A Case Studies Approach to Computational Reasoning and Problem Solving''. CRC Press.
 
* Nolan, D., & Lang, D. T. (2015). ''Data Science in R: A Case Studies Approach to Computational Reasoning and Problem Solving''. CRC Press.
 
* Elston, S. F. (2015). ''Data Science in the Cloud with Microsoft Azure Machine Learning and R.'' O'Reilly Media, Inc.
 
* Elston, S. F. (2015). ''Data Science in the Cloud with Microsoft Azure Machine Learning and R.'' O'Reilly Media, Inc.
 
* Grus, J. (2015). ''Data Science from Scratch: First Principles with Python''. O'Reilly Media.
 
* Grus, J. (2015). ''Data Science from Scratch: First Principles with Python''. O'Reilly Media.
* Madhavan, S. (2015). ''Mastering Python for Data Science''. Packt Publishing Ltd.
+
* Madhavan, S. (2015). ''Mastering Python for Data Science''. Packt Publishing.
  +
* Kale, V. (2015). ''Guide to Cloud Computing for Business and Technology Managers: From Distributed Computing to Cloudware Applications''. CRC Press.
* Leskovec, J., Rajaraman, A., & Ullman, J. D. (2014). Mining of massive datasets. Cambridge University Press. ([http://www.mmds.org/ link])
 
  +
* Ejsmont, A. (2015). ''Web Scalability for Startup Engineers''. McGraw Hill.
* Zumel, N., Mount, J., & Porzak, J. (2014). ''Practical data science with R''. Manning.
 
 
* Leskovec, J., Rajaraman, A., & Ullman, J. D. (2014). ''Mining of Massive Datasets''. Cambridge University Press. ([http://www.mmds.org/ link])
* Schutt, R., & O'Neil, C. (2013). ''Doing data science: Straight talk from the frontline''. O'Reilly Media.
 
* Tukey, J. W. (1977). ''Exploratory data analysis''. Addison-Wesley.
+
* Zumel, N., Mount, J., & Porzak, J. (2014). ''Practical Data Science with R''. Manning.
 
* Schutt, R., & O'Neil, C. (2013). ''Doing Data Science: Straight Talk from the Frontline''. O'Reilly Media.
  +
* Videla, A., & J.W. Williams, J. (2012). ''RabbitMQ in Action''. Manning.
  +
* Tukey, J. W. (1977). ''Exploratory Data Analysis''. Addison-Wesley.
   
 
==Scholarly Articles==
 
==Scholarly Articles==
Line 120: Line 147:
 
==Other Resources==
 
==Other Resources==
 
===General===
 
===General===
*[https://www.slideshare.net/kourouklides/what-is-data-science-99294704/ What is Data Science by Ioannis Kourouklides]
+
*[https://www.slideshare.net/kourouklides/what-is-data-science-99294704/ What is Data Science by Ioannis Kourouklides] - slides
 
*[https://datascienceguide.github.io/ Data Science Guide]
 
*[https://datascienceguide.github.io/ Data Science Guide]
 
*[http://jadianes.me/data-science-your-way/ Data Science Engineering, your way]
 
*[http://jadianes.me/data-science-your-way/ Data Science Engineering, your way]
Line 141: Line 168:
 
*[https://www.elastic.co/webinars/time-series-anomaly-detection-optimizing-machine-learning-jobs-in-elasticsearch Time Series Anomaly Detection: Optimizing your Machine Learning Jobs in Elasticsearch] - webinar
 
*[https://www.elastic.co/webinars/time-series-anomaly-detection-optimizing-machine-learning-jobs-in-elasticsearch Time Series Anomaly Detection: Optimizing your Machine Learning Jobs in Elasticsearch] - webinar
 
*[https://ai.googleblog.com/2017/04/federated-learning-collaborative.html Federated Learning: Collaborative Machine Learning without Centralized Training Data] - blog post
 
*[https://ai.googleblog.com/2017/04/federated-learning-collaborative.html Federated Learning: Collaborative Machine Learning without Centralized Training Data] - blog post
 
*[https://www.zurich.ibm.com/snapml/ Snap ML] - IBM
 
*[https://github.com/vsmolyakov/pyspark pyspark (GitHub)] - collection of resources
 
*[https://github.com/vsmolyakov/pyspark pyspark (GitHub)] - collection of resources
 
*[https://becominghuman.ai/cheat-sheets-for-ai-neural-networks-machine-learning-deep-learning-big-data-678c51b4b463 Cheat Sheets for AI, Neural Networks, Machine Learning, Deep Learning & Big Data] - blog post
 
*[https://becominghuman.ai/cheat-sheets-for-ai-neural-networks-machine-learning-deep-learning-big-data-678c51b4b463 Cheat Sheets for AI, Neural Networks, Machine Learning, Deep Learning & Big Data] - blog post
Line 157: Line 185:
 
*[https://github.com/siboehm/awesome-learn-datascience awesome-learn-datascience (GitHub)]
 
*[https://github.com/siboehm/awesome-learn-datascience awesome-learn-datascience (GitHub)]
 
*[https://www.logicalclocks.com/blog/when-deep-learning-with-gpus-use-a-cluster-manager When Deep Learning with GPUs, use a Cluster Manager] - blog post
 
*[https://www.logicalclocks.com/blog/when-deep-learning-with-gpus-use-a-cluster-manager When Deep Learning with GPUs, use a Cluster Manager] - blog post
*[https://otonomo.io/blog/luigi-airflow-pinball-and-chronos-comparing-workflow-management-systems/ Luigi, Airflow, Pinball, and Chronos: Comparing Workflow Management Systems]
 
*[https://github.com/kaiwaehner/kafka-streams-machine-learning-examples kafka-streams-machine-learning-examples (GitHub)] - Machine Learning + Kafka Streams Examples (with code)
 
*[https://aseigneurin.github.io/2018/09/05/realtime-machine-learning-predictions-wth-kafka-and-h2o.html Realtime Machine Learning predictions with Kafka and H2O.ai] - blog post
 
*[https://otonomo.io/blog/redis-kafka-or-rabbitmq-which-microservices-message-broker-to-choose/ Redis, Kafka or RabbitMQ: Which MicroServices Message Broker To Choose?]
 
*[https://dzone.com/articles/akka-streams-and-kafka-streams-where-microservices Akka Streams and Kafka Streams: Where Microservices Meet Fast Data]
 
*[https://dzone.com/articles/akka-spark-or-kafka-selecting-the-right-streaming Akka, Spark, or Kafka? Selecting the Right Streaming Engine]
 
   
 
===Data Annotation & Labelling===
 
===Data Annotation & Labelling===
Line 168: Line 190:
 
*[https://www.mturk.com Amazon Mechanical Turk]
 
*[https://www.mturk.com Amazon Mechanical Turk]
 
*[https://www.cloudfactory.com/ CloudFactory]
 
*[https://www.cloudfactory.com/ CloudFactory]
*[https://www.rev.com/ Rev]
+
*[https://appen.com/ Appen]
  +
*[https://www.alegion.com/ Alegion]
  +
*[https://imerit.net/ iMerit]
  +
*[https://playment.io/ Playment]
  +
*[https://www.rev.com/ Rev] - Transcription from video and audio
  +
*[https://labelbox.com/ Labelbox]
  +
*[https://github.com/diffgram/diffgram diffgram]
 
*[https://dl.acm.org/citation.cfm?id=1866696 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk]
 
*[https://dl.acm.org/citation.cfm?id=1866696 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk]
 
*[https://www.cloudfactory.com/data-annotation-tool-guide Data Annotation Tools for Machine Learning (Evolving Guide)]
 
*[https://www.cloudfactory.com/data-annotation-tool-guide Data Annotation Tools for Machine Learning (Evolving Guide)]
Line 182: Line 210:
 
*[https://medium.com/python-pandemonium/introduction-to-exploratory-data-analysis-in-python-8b6bcb55c190 Introduction to Exploratory Data Analysis in Python] - blog post
 
*[https://medium.com/python-pandemonium/introduction-to-exploratory-data-analysis-in-python-8b6bcb55c190 Introduction to Exploratory Data Analysis in Python] - blog post
 
*[https://medium.com/open-machine-learning-course/open-machine-learning-course-topic-2-visual-data-analysis-in-python-846b989675cd Visual Data Analysis with Python] - blog post
 
*[https://medium.com/open-machine-learning-course/open-machine-learning-course-topic-2-visual-data-analysis-in-python-846b989675cd Visual Data Analysis with Python] - blog post
  +
  +
===Asynchronous Communication & Microservices===
  +
*[https://microservices.io/patterns/microservices.html Pattern: Microservice Architecture]
  +
*[https://www.dineshonjava.com/software-architecture-patterns-and-designs/ Software Architecture Patterns and Designs]
  +
*[https://codeblog.dotsandbrackets.com/asynchronous-communication-with-message-queue/ Asynchronous communication with message queue]
  +
*[https://garba.org/article/general/soa/mep.html Message Exchange Patterns (MEPs)]
  +
*[https://flylib.com/books/en/2.365.1/message_exchange_patterns.html Message exchange patterns]
  +
*[https://docs.microsoft.com/en-us/azure/architecture/patterns/category/messaging Messaging patterns]
  +
*[https://medium.com/@mmz.zaeimi/synchronous-vs-asynchronous-communication-in-microservices-integration-f4dd36478fd2 Synchronous vs Asynchronous communication in microservices integration]
 
*[https://otonomo.io/blog/redis-kafka-or-rabbitmq-which-microservices-message-broker-to-choose/ Redis, Kafka or RabbitMQ: Which MicroServices Message Broker To Choose?]
 
*[https://dzone.com/articles/akka-streams-and-kafka-streams-where-microservices Akka Streams and Kafka Streams: Where Microservices Meet Fast Data]
 
*[https://dzone.com/articles/akka-spark-or-kafka-selecting-the-right-streaming Akka, Spark, or Kafka? Selecting the Right Streaming Engine]
 
*[https://otonomo.io/blog/luigi-airflow-pinball-and-chronos-comparing-workflow-management-systems/ Luigi, Airflow, Pinball, and Chronos: Comparing Workflow Management Systems]
 
*[https://github.com/kaiwaehner/kafka-streams-machine-learning-examples kafka-streams-machine-learning-examples (GitHub)] - Machine Learning + Kafka Streams Examples (with code)
 
*[https://aseigneurin.github.io/2018/09/05/realtime-machine-learning-predictions-wth-kafka-and-h2o.html Realtime Machine Learning predictions with Kafka and H2O.ai] - blog post
  +
*[https://tanzu.vmware.com/content/blog/understanding-when-to-use-rabbitmq-or-apache-kafka Understanding When to use RabbitMQ or Apache Kafka]
  +
*[https://www.ververica.com/what-is-stream-processing What is Stream Processing?]
  +
*[https://medium.com/stream-processing/what-is-stream-processing-1eadfca11b97 A Gentle Introduction to Stream Processing]
   
 
=== Distributed Systems===
 
=== Distributed Systems===
Line 194: Line 240:
 
*[http://www.cheerml.com/comparison-distributed-ml-platform A comparison of distributed machine learning platform] - blog post
 
*[http://www.cheerml.com/comparison-distributed-ml-platform A comparison of distributed machine learning platform] - blog post
 
*[https://www.logicalclocks.com/why-you-need-a-distributed-filesystem-for-deep-learning/ Distributed Filesystems for Deep Learning] - blog post
 
*[https://www.logicalclocks.com/why-you-need-a-distributed-filesystem-for-deep-learning/ Distributed Filesystems for Deep Learning] - blog post
*[https://www.zurich.ibm.com/snapml/ Snap ML] - IBM
 
 
*[https://github.com/tmulc18/Distributed-TensorFlow-Guide Distributed-TensorFlow-Guide (GitHub)] - Distributed TensorFlow basics and examples of training algorithms (with code)
 
*[https://github.com/tmulc18/Distributed-TensorFlow-Guide Distributed-TensorFlow-Guide (GitHub)] - Distributed TensorFlow basics and examples of training algorithms (with code)
   

Latest revision as of 21:35, 24 November 2020

This page contains resources about Data Science, Data Engineering and Data Management.

Subfields and Concepts[edit]

  • Agile Data Science
  • Machine Learning / Data Mining
  • Exploratory Data Analysis (EDA)
  • Data Preparation and Data Preprocessing
  • Data Fusion and Data Integration
  • Data Wrangling / Data Munging
  • Data Scraping
  • Data Sampling
  • Data Cleaning
  • Data Visualization
  • Explainable AI (XAI) / Interpretable AI
  • Big Data
  • Data Engineering, Data Management and Databases
  • High Performance/Parallel/Distributed/Cloud Computing for Machine Learning
  • Concurrent/Multi-threading Computing for Machine Learning
  • Synchronous Communication (for Web Services)
    • Representational State Transfer (REST) Protocol
    • Remote Procedure Call (RPC)
    • Simple Object Access Protocol (SOAP)
  • Asynchronous Communication / Asynchronous Messaging (for Web Services)
    • Message broker/Message bus/Event bus/Integration broker/Interface engine
    • Message queue
    • Asynchronous protocols
      • Advanced Message Queuing Protocol (AMQP)
      • MQ Telemetry Transport (MQTT)
  • Messaging patterns
    • Fire-and-Forget / One-Way
    • Request-Response / Request-Reply
    • Publisher-Subscriber
    • Request-Callback
  • Software Architecture
    • Monolithic Architecture
    • Microservices Architecture
    • Service-Oriented Architecture (SOA)
  • Stream Processing

Online courses[edit]

Video Lectures[edit]

Lecture Notes[edit]

Books[edit]

  • Newman, S. (2021). Building Microservices: Designing Fine-Grained Systems. 2nd Ed. O'Reilly Media.
  • Bellemare, A. (2020). Building Event-Driven Microservices: Leveraging Organizational Data at Scale. O'Reilly Media.
  • Richards, M. (2020). Fundamentals of Software Architecture. O'Reilly Media.
  • Dean A., & Crettaz, V. (2019). Event Streams in Action. Manning.
  • Richardson, C. (2018). Microservices Patterns. Manning Publications.
  • Pacheco, V. F. (2018). Microservice Patterns and Best Practices. Packt Publishing.
  • De la Torre C., Wagner, B., & Rousos, M. (2018). .NET Microservices: Architecture for Containerized .NET Applications. Microsoft Corporation. (link)
  • Lanaro, G. (2017). Python High Performance. Packt Publishing.
  • Wickham, H., & Grolemund, G. (2017). R for Data Science. O'Reilly Media.
  • Kleppmann, M. (2017). Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. O'Reilly Media.
  • VanderPlas, J. (2016). Python Data Science Handbook: Essential Tools for Working with Data. O'Reilly Media.
  • Pierfederici, F. (2016). Distributed Computing with Python. Packt Publishing.
  • Dunning, T., & Friedman, E. (2016). Streaming Architecture: New Designs Using Apache Kafka and MapR Streams. O'Reilly Media.
  • Nolan, D., & Lang, D. T. (2015). Data Science in R: A Case Studies Approach to Computational Reasoning and Problem Solving. CRC Press.
  • Elston, S. F. (2015). Data Science in the Cloud with Microsoft Azure Machine Learning and R. O'Reilly Media, Inc.
  • Grus, J. (2015). Data Science from Scratch: First Principles with Python. O'Reilly Media.
  • Madhavan, S. (2015). Mastering Python for Data Science. Packt Publishing.
  • Kale, V. (2015). Guide to Cloud Computing for Business and Technology Managers: From Distributed Computing to Cloudware Applications. CRC Press.
  • Ejsmont, A. (2015). Web Scalability for Startup Engineers. McGraw Hill.
  • Leskovec, J., Rajaraman, A., & Ullman, J. D. (2014). Mining of Massive Datasets. Cambridge University Press. (link)
  • Zumel, N., Mount, J., & Porzak, J. (2014). Practical Data Science with R. Manning.
  • Schutt, R., & O'Neil, C. (2013). Doing Data Science: Straight Talk from the Frontline. O'Reilly Media.
  • Videla, A., & J.W. Williams, J. (2012). RabbitMQ in Action. Manning.
  • Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley.

Scholarly Articles[edit]

  • Verbraeken, J., Wolting, M., Katzy, J., Kloppenburg, J., Verbelen, T., & Rellermeyer, J. S. (2020). A Survey on Distributed Machine Learning. ACM Computing Surveys (CSUR), 53(2), 1-33.
  • Buchlovsky, P. ... (2018). TF-Replicator: Distributed Machine Learning for Researchers. arXiv preprint arXiv:1902.00465.
  • Kang, D., Emmons, J., Abuzaid, F., Bailis, P., & Zaharia, M. (2017). NoScope: optimizing neural network queries over video at scale. Proceedings of the VLDB Endowment, 10(11), 1586-1597.
  • Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). Why Should I Trust You? Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1135-1144).
  • Xing, E. P., Ho, Q., Xie, P., & Wei, D. (2016). Strategies and principles of distributed machine learning on big data. Engineering, 2(2), 179-195.
  • Salloum, S., Dautov, R., Chen, X., Peng, P. X., & Huang, J. Z. (2016). Big data analytics on Apache Spark. International Journal of Data Science and Analytics, 1(3-4), 145-164.
  • Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., ... & Dennison, D. (2015). Hidden technical debt in machine learning systems. In Advances in Neural Information Processing Systems (pp. 2503-2511).
  • Huang, Y., Zhu, F., Yuan, M., Deng, K., Li, Y., Ni, B., ... & Zeng, J. (2015). Telco Churn Prediction with Big Data. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (pp. 607-618).
  • Moritz, P., Nishihara, R., Stoica, I., & Jordan, M. I. (2015). SparkNet: Training Deep Networks in Spark. arXiv preprint arXiv:1511.06051.
  • Upadhyaya, S. R. (2013). Parallel approaches to machine learning—A comprehensive survey. Journal of Parallel and Distributed Computing, 73(3), 284-292.
  • Sakr, S., Liu, A., Batista, D. M., & Alomari, M. (2011). A survey of large scale data management approaches in cloud environments. IEEE Communications Surveys & Tutorials, 13(3), 311-336.
  • Abadi, D. J. (2009). Data management in the cloud: Limitations and opportunities. IEEE Data Eng. Bull., 32(1), 3-12.

Software[edit]

See also[edit]

Other Resources[edit]

General[edit]

Data Annotation & Labelling[edit]

EDA[edit]

Asynchronous Communication & Microservices[edit]

Distributed Systems[edit]

Deployment and Production[edit]