DSB #141

Hi,

DSB is here, so spend the last few days of January 2023 with this volume! Everybody knows about chatGPT, LLM (large language models) and their variations are everywhere (uff). This time I would recommend, especially for practitioners, an article from MLOps & MLReg that focuses on optimizing deep learning models. But since the volume is released after some time, all articles are really good. So go through all of them!

And as always, enjoy your reading.

Analytical

https://www.aidancooper.co.uk/how-shapley-values-work/ – You are probably already using SHAP (or weighted SHAP, that was introduced in DSB #139), but maybe you don’t know how it works. So give it a try and learn what is under the hood of Shapley values.

https://ai.googleblog.com/2023/01/google-research-2022-beyond-language.html – First part in a series where Google researchers will higlight progress they’ve made in 2022 and present their vision for 2023 and beyond.

https://towardsdatascience.com/demystifying-efficient-self-attention-b3de61b9b0fb – Comprehensive overview of the different types of efficient attention with intuitive explanation.

Computer Science & Science

https://vadimkravcenko.com/shorts/things-they-didnt-teach-you/ – Code/model is secondary, business value is the first and the most important. This article describes the harsh reality of software engineering and definitely also applies to data science.

https://stackoverflow.blog/2022/12/30/you-should-be-reading-academic-computer-science-papers/ – The importance of reading academic papers for growth of your skills.

https://www.theregister.com/2022/12/21/ai_assistants_bad_code/ – Suprisingly, programmers who accept help from AI tools like Github Copilot or Facebook InCoder produce less secure code than those who don’t. Also there is an important lawsuit against Microsoft, GitHub, and OpenAI for allegedly violating copyright law, which can change the rules of the game.

Graphs and Visualizations

https://www.nature.com/articles/s41467-020-19160-7.pdf – Guide for the scientific use of colour to make it accessible for people with colour-vision deficiencies.

https://make-a-video3d.github.io/ – MAV3D is a method to generate three dimensional dynamic scenes from text by Meta AI.

https://milospopovic.net/6-ways-to-map-population-with-r.r/ – Map population density in R.

Business and Career

https://finance.yahoo.com/news/millennial-founder-sold-her-company-205034590.html – JP Morgan filed a lawsuit against
Charlie Javice and Olivier Amar, claiming the pair fabricated around 4 million nonexistent accounts in their platform
Frank Financial Aid, which JP Morgan purchased for $175 million. Beware of what you’re buying.

https://aeon.co/essays/innovation-is-overvalued-maintenance-often-matters-more – Why is ordinary and boring maintenance much more important than innovations are. “A professional innovation consultant advised his clients to ban the word at their companies. He said it was just a ‘word to hide the lack of substance’“.

https://futurism.com/deep-learning-expert-gpt-startups-rude-awakening – With extreme hype around chatGPT it can possily create a bubble full of empty promisess, unrealistic expectations and ideas without a real basis.

Pop

https://www.nature.com/articles/d41586-023-00023-2 – The hype around AI is almost unbearable and tiresome, yet we still have great problems with reproducibility and reliability of models – for example in health care. But hopefully it’ll slowly change while the topic is getting more deserved focus.

https://www.washingtonpost.com/technology/2023/01/27/chatgpt-google-meta/ – Why was chatGPT such a success while Meta’s Blenderbot released 3 months earlier not? It was boring and too safe. ChatGPT is not more innovative than other solutions (at least according to Yann LeCun), it’s just more fun to use and very well engineered.

https://www.vice.com/en/article/88q3gk/chinese-students-invent-invisibility-cloak – Chinese students have invented a coat that makes people invisible to AI cameras. (rcmd by reader)

Education

https://kidger.site/thoughts/just-know-stuff/ – Over the years we have shared a multiple list of things that data scientists should know. This list is different, more detailed with unusual tips by a PhD graduate from Oxford.

https://www.coursera.org/specializations/mathematics-for-machine-learning-and-data-science – DeepLearning.AI prepared a coursera specialization on mathematics in data science. It’s now available.

https://pythonspeed.com/articles/polars-memory-pandas/ – If you’re using parquet and Pandas try Polars instead. It’s lazily evaluated, but thanks to that it’s faster and more memory efficient.

Datasets & Libraries

https://github.com/srush/Tensor-Puzzles – Tensor Puzzles is a collection of 16 tensor puzzles that help you to understand and demystify how tensors work.

https://github.com/karpathy/minGPT – Our favorite Andrej Karpahty prepared a PyTorch re-implementation of GPT, which he calls minGPT. You can also watch his educational video where he explains the implementation of nanoGTP.

https://docs.google.com/spreadsheets/d/1O5KVQW1Hx5ZAkcg8AIRjbQLQzx2wVaLl0SqUu-ir9Fs/edit#gid=1158069878 – Clear and concise list of multiple LLM models with metrics and information, who created them, when they were announced, whether they are public, links, repositories and more.

MLOps & MLReg

https://github.com/google-research/tuning_playbook – Do you want to maximize performance of your deep learning model? Then try this repo with emphasis on the process of hyperparameter tuning. It is created by people from Google Research and Harvard University.

https://building.nubank.com.br/automatic-retraining-for-machine-learning-models/ – ML models lose their performance over time, therefore they need to be retrained after a while. Brazilian neobank Nubank is sharing ideas on how they are doing it.

https://mlops.community/optimizing-machine-learning-training-pipelines/ – Techniques to speed up training, improve the machine learning engineer experience, and keep costs under control.

Video & Podcast

https://youtu.be/iE5QLrzkGBU – Python is the go-to language for working with data and perhaps once you’ve mastered Pandas you think that’s all you need…well legendary James Powell may change your mind to work on your python skills a bit more. (rcmd by reader)

https://youtu.be/qs0D9sdbKPU – The Cartesian Cafe with Timothy Nguyen is an extremely interesting YouTube channel with interviews discussing different topics very deeply. This episode is about quantum computing with professor of computer science Scott Aaronson. He will explain to you why most of the things you read about quantum computing are nonsense.

https://google-research.github.io/seanet/musiclm/examples/ – Now we have not only LLM, but also MLM (music LM) and it’s fun to use. Try it in the article.

Papers & Books

https://arxiv.org/abs/2212.14034 – Train a LLM on a single GPU in One Day! Code is available here.

https://arxiv.org/abs/2301.02828 – Why k-nearest neighbor language models (kNN-LMs) perform better than standard parametric LMs?

Behind the Fence

https://jobs.smartrecruiters.com/PublicisGroupe/743999873316181-data-scientist-machine-learning-engineer – Data Scientist in Epsilon, Irving, USA.

Joke

https://www.monkeyuser.com/assets/images/2023/257-would-you-rather.png 😀 😀 (rcmd by reader)