This wiki has no edits or logs made within the last 45 days, therefore it is marked as inactive. If you would like to prevent this wiki from being closed, please start showing signs of activity here. If there are no signs of this wiki being used within the next 15 days, this wiki may be closed per the Dormancy Policy. This wiki will then be eligible for adoption by another user. If not adopted and still inactive 135 days from now, this wiki will become eligible for deletion. Please be sure to familiarize yourself with Miraheze's Dormancy Policy. If you are a bureaucrat, you can go to Special:ManageWiki and uncheck "inactive" yourself. If you have any other questions or concerns, please don't hesitate to ask at Stewards' noticeboard.

Difference between revisions of "Data Science"

From Ioannis Kourouklides
Jump to navigation Jump to search
 
(13 intermediate revisions by the same user not shown)
Line 1: Line 1:
This page contains resources about [https://en.wikipedia.org/wiki/Data_science Data Science], including '''Data Engineering''' and [https://en.wikipedia.org/wiki/Data_management Data Management].
+
This page contains resources about [https://en.wikipedia.org/wiki/Data_science Data Science], '''Data Engineering''' and [https://en.wikipedia.org/wiki/Data_management Data Management].
   
 
== Subfields and Concepts ==
 
== Subfields and Concepts ==
Line 29: Line 29:
 
* Messaging patterns
 
* Messaging patterns
 
** Fire-and-Forget / One-Way
 
** Fire-and-Forget / One-Way
** Request-Response
+
** Request-Response / Request-Reply
 
** Publisher-Subscriber
 
** Publisher-Subscriber
** Callback
+
** Request-Callback
 
* Software Architecture
 
* Software Architecture
 
** Monolithic Architecture
 
** Monolithic Architecture
 
** Microservices Architecture
 
** Microservices Architecture
 
** Service-Oriented Architecture (SOA)
 
** Service-Oriented Architecture (SOA)
  +
* Stream Processing
   
 
== Online courses ==
 
== Online courses ==
Line 52: Line 53:
 
==Books==
 
==Books==
 
* Newman, S. (2021). ''Building Microservices: Designing Fine-Grained Systems''. 2nd Ed. O'Reilly Media.
 
* Newman, S. (2021). ''Building Microservices: Designing Fine-Grained Systems''. 2nd Ed. O'Reilly Media.
  +
* Bellemare, A. (2020). ''Building Event-Driven Microservices: Leveraging Organizational Data at Scale''. O'Reilly Media.
  +
* Richards, M. (2020). ''Fundamentals of Software Architecture''. O'Reilly Media.
  +
* Dean A., & Crettaz, V. (2019). ''Event Streams in Action''. Manning.
 
* Richardson, C. (2018). ''Microservices Patterns''. Manning Publications.
 
* Richardson, C. (2018). ''Microservices Patterns''. Manning Publications.
 
* Pacheco, V. F. (2018). ''Microservice Patterns and Best Practices''. Packt Publishing.
 
* Pacheco, V. F. (2018). ''Microservice Patterns and Best Practices''. Packt Publishing.
Line 60: Line 64:
 
* VanderPlas, J. (2016). ''Python Data Science Handbook: Essential Tools for Working with Data''. O'Reilly Media.
 
* VanderPlas, J. (2016). ''Python Data Science Handbook: Essential Tools for Working with Data''. O'Reilly Media.
 
* Pierfederici, F. (2016). ''Distributed Computing with Python''. Packt Publishing.
 
* Pierfederici, F. (2016). ''Distributed Computing with Python''. Packt Publishing.
  +
* Dunning, T., & Friedman, E. (2016). ''Streaming Architecture: New Designs Using Apache Kafka and MapR Streams.'' O'Reilly Media.
 
* Nolan, D., & Lang, D. T. (2015). ''Data Science in R: A Case Studies Approach to Computational Reasoning and Problem Solving''. CRC Press.
 
* Nolan, D., & Lang, D. T. (2015). ''Data Science in R: A Case Studies Approach to Computational Reasoning and Problem Solving''. CRC Press.
 
* Elston, S. F. (2015). ''Data Science in the Cloud with Microsoft Azure Machine Learning and R.'' O'Reilly Media, Inc.
 
* Elston, S. F. (2015). ''Data Science in the Cloud with Microsoft Azure Machine Learning and R.'' O'Reilly Media, Inc.
Line 69: Line 74:
 
* Zumel, N., Mount, J., & Porzak, J. (2014). ''Practical Data Science with R''. Manning.
 
* Zumel, N., Mount, J., & Porzak, J. (2014). ''Practical Data Science with R''. Manning.
 
* Schutt, R., & O'Neil, C. (2013). ''Doing Data Science: Straight Talk from the Frontline''. O'Reilly Media.
 
* Schutt, R., & O'Neil, C. (2013). ''Doing Data Science: Straight Talk from the Frontline''. O'Reilly Media.
  +
* Videla, A., & J.W. Williams, J. (2012). ''RabbitMQ in Action''. Manning.
 
* Tukey, J. W. (1977). ''Exploratory Data Analysis''. Addison-Wesley.
 
* Tukey, J. W. (1977). ''Exploratory Data Analysis''. Addison-Wesley.
   
Line 141: Line 147:
 
==Other Resources==
 
==Other Resources==
 
===General===
 
===General===
*[https://www.slideshare.net/kourouklides/what-is-data-science-99294704/ What is Data Science by Ioannis Kourouklides]
+
*[https://www.slideshare.net/kourouklides/what-is-data-science-99294704/ What is Data Science by Ioannis Kourouklides] - slides
 
*[https://datascienceguide.github.io/ Data Science Guide]
 
*[https://datascienceguide.github.io/ Data Science Guide]
 
*[http://jadianes.me/data-science-your-way/ Data Science Engineering, your way]
 
*[http://jadianes.me/data-science-your-way/ Data Science Engineering, your way]
Line 184: Line 190:
 
*[https://www.mturk.com Amazon Mechanical Turk]
 
*[https://www.mturk.com Amazon Mechanical Turk]
 
*[https://www.cloudfactory.com/ CloudFactory]
 
*[https://www.cloudfactory.com/ CloudFactory]
*[https://www.rev.com/ Rev]
+
*[https://appen.com/ Appen]
  +
*[https://www.alegion.com/ Alegion]
  +
*[https://imerit.net/ iMerit]
  +
*[https://playment.io/ Playment]
  +
*[https://www.rev.com/ Rev] - Transcription from video and audio
  +
*[https://labelbox.com/ Labelbox]
  +
*[https://github.com/diffgram/diffgram diffgram]
 
*[https://dl.acm.org/citation.cfm?id=1866696 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk]
 
*[https://dl.acm.org/citation.cfm?id=1866696 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk]
 
*[https://www.cloudfactory.com/data-annotation-tool-guide Data Annotation Tools for Machine Learning (Evolving Guide)]
 
*[https://www.cloudfactory.com/data-annotation-tool-guide Data Annotation Tools for Machine Learning (Evolving Guide)]
Line 205: Line 217:
 
*[https://garba.org/article/general/soa/mep.html Message Exchange Patterns (MEPs)]
 
*[https://garba.org/article/general/soa/mep.html Message Exchange Patterns (MEPs)]
 
*[https://flylib.com/books/en/2.365.1/message_exchange_patterns.html Message exchange patterns]
 
*[https://flylib.com/books/en/2.365.1/message_exchange_patterns.html Message exchange patterns]
  +
*[https://docs.microsoft.com/en-us/azure/architecture/patterns/category/messaging Messaging patterns]
 
*[https://medium.com/@mmz.zaeimi/synchronous-vs-asynchronous-communication-in-microservices-integration-f4dd36478fd2 Synchronous vs Asynchronous communication in microservices integration]
 
*[https://medium.com/@mmz.zaeimi/synchronous-vs-asynchronous-communication-in-microservices-integration-f4dd36478fd2 Synchronous vs Asynchronous communication in microservices integration]
 
*[https://otonomo.io/blog/redis-kafka-or-rabbitmq-which-microservices-message-broker-to-choose/ Redis, Kafka or RabbitMQ: Which MicroServices Message Broker To Choose?]
 
*[https://otonomo.io/blog/redis-kafka-or-rabbitmq-which-microservices-message-broker-to-choose/ Redis, Kafka or RabbitMQ: Which MicroServices Message Broker To Choose?]
Line 212: Line 225:
 
*[https://github.com/kaiwaehner/kafka-streams-machine-learning-examples kafka-streams-machine-learning-examples (GitHub)] - Machine Learning + Kafka Streams Examples (with code)
 
*[https://github.com/kaiwaehner/kafka-streams-machine-learning-examples kafka-streams-machine-learning-examples (GitHub)] - Machine Learning + Kafka Streams Examples (with code)
 
*[https://aseigneurin.github.io/2018/09/05/realtime-machine-learning-predictions-wth-kafka-and-h2o.html Realtime Machine Learning predictions with Kafka and H2O.ai] - blog post
 
*[https://aseigneurin.github.io/2018/09/05/realtime-machine-learning-predictions-wth-kafka-and-h2o.html Realtime Machine Learning predictions with Kafka and H2O.ai] - blog post
  +
*[https://tanzu.vmware.com/content/blog/understanding-when-to-use-rabbitmq-or-apache-kafka Understanding When to use RabbitMQ or Apache Kafka]
  +
*[https://www.ververica.com/what-is-stream-processing What is Stream Processing?]
  +
*[https://medium.com/stream-processing/what-is-stream-processing-1eadfca11b97 A Gentle Introduction to Stream Processing]
   
 
=== Distributed Systems===
 
=== Distributed Systems===

Latest revision as of 21:35, 24 November 2020

This page contains resources about Data Science, Data Engineering and Data Management.

Subfields and Concepts[edit]

  • Agile Data Science
  • Machine Learning / Data Mining
  • Exploratory Data Analysis (EDA)
  • Data Preparation and Data Preprocessing
  • Data Fusion and Data Integration
  • Data Wrangling / Data Munging
  • Data Scraping
  • Data Sampling
  • Data Cleaning
  • Data Visualization
  • Explainable AI (XAI) / Interpretable AI
  • Big Data
  • Data Engineering, Data Management and Databases
  • High Performance/Parallel/Distributed/Cloud Computing for Machine Learning
  • Concurrent/Multi-threading Computing for Machine Learning
  • Synchronous Communication (for Web Services)
    • Representational State Transfer (REST) Protocol
    • Remote Procedure Call (RPC)
    • Simple Object Access Protocol (SOAP)
  • Asynchronous Communication / Asynchronous Messaging (for Web Services)
    • Message broker/Message bus/Event bus/Integration broker/Interface engine
    • Message queue
    • Asynchronous protocols
      • Advanced Message Queuing Protocol (AMQP)
      • MQ Telemetry Transport (MQTT)
  • Messaging patterns
    • Fire-and-Forget / One-Way
    • Request-Response / Request-Reply
    • Publisher-Subscriber
    • Request-Callback
  • Software Architecture
    • Monolithic Architecture
    • Microservices Architecture
    • Service-Oriented Architecture (SOA)
  • Stream Processing

Online courses[edit]

Video Lectures[edit]

Lecture Notes[edit]

Books[edit]

  • Newman, S. (2021). Building Microservices: Designing Fine-Grained Systems. 2nd Ed. O'Reilly Media.
  • Bellemare, A. (2020). Building Event-Driven Microservices: Leveraging Organizational Data at Scale. O'Reilly Media.
  • Richards, M. (2020). Fundamentals of Software Architecture. O'Reilly Media.
  • Dean A., & Crettaz, V. (2019). Event Streams in Action. Manning.
  • Richardson, C. (2018). Microservices Patterns. Manning Publications.
  • Pacheco, V. F. (2018). Microservice Patterns and Best Practices. Packt Publishing.
  • De la Torre C., Wagner, B., & Rousos, M. (2018). .NET Microservices: Architecture for Containerized .NET Applications. Microsoft Corporation. (link)
  • Lanaro, G. (2017). Python High Performance. Packt Publishing.
  • Wickham, H., & Grolemund, G. (2017). R for Data Science. O'Reilly Media.
  • Kleppmann, M. (2017). Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. O'Reilly Media.
  • VanderPlas, J. (2016). Python Data Science Handbook: Essential Tools for Working with Data. O'Reilly Media.
  • Pierfederici, F. (2016). Distributed Computing with Python. Packt Publishing.
  • Dunning, T., & Friedman, E. (2016). Streaming Architecture: New Designs Using Apache Kafka and MapR Streams. O'Reilly Media.
  • Nolan, D., & Lang, D. T. (2015). Data Science in R: A Case Studies Approach to Computational Reasoning and Problem Solving. CRC Press.
  • Elston, S. F. (2015). Data Science in the Cloud with Microsoft Azure Machine Learning and R. O'Reilly Media, Inc.
  • Grus, J. (2015). Data Science from Scratch: First Principles with Python. O'Reilly Media.
  • Madhavan, S. (2015). Mastering Python for Data Science. Packt Publishing.
  • Kale, V. (2015). Guide to Cloud Computing for Business and Technology Managers: From Distributed Computing to Cloudware Applications. CRC Press.
  • Ejsmont, A. (2015). Web Scalability for Startup Engineers. McGraw Hill.
  • Leskovec, J., Rajaraman, A., & Ullman, J. D. (2014). Mining of Massive Datasets. Cambridge University Press. (link)
  • Zumel, N., Mount, J., & Porzak, J. (2014). Practical Data Science with R. Manning.
  • Schutt, R., & O'Neil, C. (2013). Doing Data Science: Straight Talk from the Frontline. O'Reilly Media.
  • Videla, A., & J.W. Williams, J. (2012). RabbitMQ in Action. Manning.
  • Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley.

Scholarly Articles[edit]

  • Verbraeken, J., Wolting, M., Katzy, J., Kloppenburg, J., Verbelen, T., & Rellermeyer, J. S. (2020). A Survey on Distributed Machine Learning. ACM Computing Surveys (CSUR), 53(2), 1-33.
  • Buchlovsky, P. ... (2018). TF-Replicator: Distributed Machine Learning for Researchers. arXiv preprint arXiv:1902.00465.
  • Kang, D., Emmons, J., Abuzaid, F., Bailis, P., & Zaharia, M. (2017). NoScope: optimizing neural network queries over video at scale. Proceedings of the VLDB Endowment, 10(11), 1586-1597.
  • Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). Why Should I Trust You? Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1135-1144).
  • Xing, E. P., Ho, Q., Xie, P., & Wei, D. (2016). Strategies and principles of distributed machine learning on big data. Engineering, 2(2), 179-195.
  • Salloum, S., Dautov, R., Chen, X., Peng, P. X., & Huang, J. Z. (2016). Big data analytics on Apache Spark. International Journal of Data Science and Analytics, 1(3-4), 145-164.
  • Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., ... & Dennison, D. (2015). Hidden technical debt in machine learning systems. In Advances in Neural Information Processing Systems (pp. 2503-2511).
  • Huang, Y., Zhu, F., Yuan, M., Deng, K., Li, Y., Ni, B., ... & Zeng, J. (2015). Telco Churn Prediction with Big Data. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (pp. 607-618).
  • Moritz, P., Nishihara, R., Stoica, I., & Jordan, M. I. (2015). SparkNet: Training Deep Networks in Spark. arXiv preprint arXiv:1511.06051.
  • Upadhyaya, S. R. (2013). Parallel approaches to machine learning—A comprehensive survey. Journal of Parallel and Distributed Computing, 73(3), 284-292.
  • Sakr, S., Liu, A., Batista, D. M., & Alomari, M. (2011). A survey of large scale data management approaches in cloud environments. IEEE Communications Surveys & Tutorials, 13(3), 311-336.
  • Abadi, D. J. (2009). Data management in the cloud: Limitations and opportunities. IEEE Data Eng. Bull., 32(1), 3-12.

Software[edit]

See also[edit]

Other Resources[edit]

General[edit]

Data Annotation & Labelling[edit]

EDA[edit]

Asynchronous Communication & Microservices[edit]

Distributed Systems[edit]

Deployment and Production[edit]