PyData Bangalore Meetup #7
PyData (Bangalore) is a community of users and developers of all things data - Data Science, Data Engineering, Machine Learning, Deep Learning, Data Ethics, Visualization, etc. We gather to discuss how best to apply Python tools, as well as those using R and Julia, to meet the evolving challenges in our use cases. We all get together every month with 4 talks, a few lightening talks to share ideas and learn from each other.
We organized the first meetup of 2020 (7th overall) on Saturday, 18th Jan. Wonderful people at ABInBev, Bangalore offered us their space to host this month’s meetup. Got a bit late in writing the proceedings, but here’s what happened throughout the session.
With the audience gathered, Shashi, from AbInBev, opened the session at 10:30 AM. He invited the first Speaker, Akash Swamy, to the stage. Akash is a Sr. Machine Learning Engineer working towards building scalable solutions for various clients. Check out his github for more details. The topic of Akash’s 30 minute talk was - Defensive Programming for Deep Learning with Pytorch. Slides are here. His talk was around how to have proper coding habits when you are putting ML/DL code in production.
- First point he covered was Code Modularity. After doing your research on the model you’re going to use, divide it into different modules. For example, instead of creating one script or single class for all things related to AutoEncoder, create one class for Encoder, another for Decoder and last one for AutoEncoder. This way, if you want to create a Seq2Seq model you can always use Encoder and Decoder within a Seq2Seq class. So the modularity helps with less bugs and more robust production code.
- Next he mentioned the use of Type Annotations which are released in Python 3.5. You should define your annotations first which are made from the basic annotation types. For example, type annotation for a bounding box will be a list of tuple of floats. He also mentioned the use of mypy to do static type checking.
- Another important part he mentioned was Error Handling. Instead of putting a lot of conditions in a single
assert
statement, break it into multipleassert
statements. This way you can have separate and better error handling along with the proper error message for each type of error. Which helps while debugging. - Last but not the least, Unit Testing. He discussed about how to go about unit testing your modules and why it’s important to have such test in Deep Learning.
Followed by the presentation, Akash answered a few questions from the audience. After closing, next speaker, Tamojit Maiti was invited on the stage. Tamojit is a Masters student interning at AbInBev Bangalore this semester. Check out his LinkedIn for more details. Tamojit spoke for about 30 minutes on State-of-the-Art Dimensionality Reduction Techniques. Slides and associated notebooks are here. Following is what he covered in his talk.
- What is dimensionality reduction and why do we need it.
- He discussed matrix factorization for dimensionality reduction; its pros and cons
- Next was Manifold Learning. What is it? Pros and cons. A unified approach to Manifold learning.
- Multi-dimensional Scaling. How it works. What are the parameters. How to use the
scikit-learn
implementation of MDS. Its drawbacks. - Locally Linear Embedding. How it works. What are the parameters. How to use the
scikit-learn
implementation of MDS. Its drawbacks. - t-Distributed Stochastic Neighborhood Embedding. How it works.
- t-SNE. How it works. What are the hyperparameters. What are its drawbacks.
- How to select the best method? There’s no universal metric that works for everything as it’s very hard to quantify information loss after dimensionality reduction. Different loss function measure different things and depending on the use-case we decide what to pick.
Followed by QA session, we had a 15 minute break (which is also a networking session). Within this break, Shashi, the venue host, took the attendees for the tour of the cool and funky AbInBev Bangalore office. After having all the tea-coffee everyone returned to their palces. And we started with the second round of talks.
Our next speaker was Manu Joseph. Manu is a self-taught Data Scientist with about 8+ years of professional experience currently a researcher at Thoucentric Analytics. Check out more about him at LinkedIn. Manu’s 45 minute long talk’s topic was - Interpretability: Cracking open the Black Box. Slides are here. He has also prepared a 3-part blog series which you find here, here and here. Following are the talk highlights.
- Why do we need explainable AI? Models are interacting with humans and humans might not understand some decisions made by machines. The interpretation of an ML algo is needed. And this is especially difficult in DL/NN based models.
- Transparent models: Linear Regression. Coefficients are interpretable. Feature importance.
- Transparent models: Decision Trees. Tree visualization using dtreeviz.
- Post-hoc Interpretation: Mean Decrease in Impurity. What’s the algorithm. How to implement it. What are the advantages and disadvantages. How to interpret mean decrease in impurity.
- Post-hoc Interpretation: Drop Column Importance. What’s the algorithm. How to implement it. What are the advantages and disadvantages.
- Post-hoc Interpretation: Permutation Importance. What’s the algorithm. How to implement it. What are the advantages and disadvantages. How to interpret the permutation importance.
- Post-hoc Interpretation: Partial Dependence Plots. What’s the algorithm. How to implement it. What are the advantages and disadvantages. How to interpret the partial dependence plots.
- Post-hoc Interpretation: Local Interpretable Model-agnostic Explanations (LIME). What is it. What’s the algorithm. How to implement it. What are the advantages and disadvantages. How to interpret it.
- Post-hoc Interpretation: Shapely Values and Shapely Additive Explanations (SHAP). What is it and how it’s a borrowed concept from Game Theory. How it works. Mathematical guarantee of Shapely values. Different versions of SHAP. Implementation. And how to interpret the plots and values.
- The ending slide was about Ethics in AI. Manu exhorted everyone to think about ethics and proper interpretability while building their models. Our modeling decisions might be really cause adverse effects for the consumers.
Next 45 minute talk and the very last one was by Ramya Ragupathy. She works with Humanitarian OpenStreetMap Team. Find out more about her at Twitter. The title of her talk was OpenStreetMap Data Processing with Python. Slides are here. Following are the highlights.
- What is OpenStreetMap (OSM)? Basically, wikipedia of maps. An open-source, global and editable geodata source.
- Rapidly growing community
- Editable by anyone; How editing works.
- Who uses OSM?
- What is GeoData? basically things that have a location - roads, building, census boundaries, temporal events (cloud cover, geo-located tweets)
- Storage formats of geo data: Rastor. What is rastor data? data storing in pixels. What is a pixel? What info can it contain - color, height, slope, direction, etc. Commonly used in satellite imagery, weather data, etc. OSM provides rastor data in for of satellite imagery.
- Storage formats of geo data: Vector. What is vector data? stores geometry, attribute and location info. No pixelation occurs with zooming. Dynamically rendered. Geometric classes of vector data: point, line, polygon. Attributes associated with vectors. Each vectors and associated attributes are like a row in a table.
- Accessing OSM data. File formats - GeoJSON, Shape file, KML, PBF, GeoPackade. Python modules: Shapely, Geopandas, OSMnx.
- Interactive maps with mplleaflet and folium.
After the talk, Ramya fielded various questions from the audience. Once they were over, lightening talks happened. The concept of lightening talks is, anyone can come up the stage and talk about anything. Anything they have recently learned, any project they are working on, anything they are working on at work. These are informal talk and you are not required to have any slides or anything. Just come on stage and speak. We had 3 such lightening talks.
- One was Shashi, the venue host. He is a data science manager in AbInBev Bangalore. He spoke about what they do. What kind of problems they solve. What they will be doing. He also mentioned about a few openings in the team.
- Second was the talk about Code Quality in python code. I don’t remember his name (I am sorry buddy if you are reading this). He talked about context managers and debuggers. He was inspired by the Akash’s talk and decided to give this lightening talk.
- Third was about Azure logging in your code and how to work through the dashboard Azure provides. This speaker (sorry friend, can’t remember your name as well) works at AbInBev Bangalore and they use this azure logging and the dashboard in their projects.
This was the end of the lightening talks. Two attendees also announced about job openings in their companies. After this, everyone dispersed for their pizzas provided by incredible folks at AbInBev. This is also a networking session where you get to meet and talk with the other attendees and the speakers.
Thus we had another successful session of PyData Bangalore. If you would like to speak at the future PyData meetup, post your proposal as an issue on this link. Yeah we use open-source even for this :p. Once you press New Issue
, you’ll be given a format, just fill according to it and one of the PyData team members will take is forward. As easy as that. And here are the attendees and the speakers in AbInBev office.