Arindam Paul is a graduating PhD student at Northwestern University specializing in the optimization of machine learning for predicting and discovering new materials. A chemical engineer and computer scientist turned machine learning researcher, Arindam’s research has been published in a variety of international journals and at several conferences including SIGCHI and NeurIPS. In his free time, Arindam is extremely active on Quora, where he has amassed over 650,000 answer views across a range of machine learning topics.
Our conversation was a reflection of Arindam’s diverse interests, encompassing his thoughts about machine learning in social media, data mining techniques, and where the hype around machine learning should really be directed. If you want to read more on the AI issues that the experts are excited about, feel free to check out the rest of our interview series here.
Gengo: How did you come to be involved in machine learning?
Arindam: Although machine learning hasn’t always been my specialist subject, I’ve been very curious about it since childhood. During high school I programmed extensively in Basic and C++, which led me to do a minor in computer science as an undergrad. This became my primary field of study as a postgrad, where I completed an MEng in Software Systems before starting my PhD in Computer Science at Northwestern.
During my first two years here, I was lucky enough to be able to rotate across different labs and explore a range of research areas, from Internet privacy and anonymous social media to P2P and distributed systems. I particularly love that my PhD thesis allows me to harness my knowledge of materials science and chemistry to solve problems in materials informatics and cheminformatics.
G: Which problems does your research focus on?
A: These are some of the problems I tried to solve in my PhD:
- Using machine learning to develop search-based methods that optimize microstructures for use in high-speed parts. These are used in industries that have many design constraints, such as aerospace.
- Developing machine-learning based simulations for additive manufacturing (AM). AM is not just used for building prototypes such as in 3D-printing, but also for building specialized parts with expensive materials such as titanium. This means you can’t use trial and error to get the final part right. Material scientists have been using computational simulations based on heat energy calculations. However, these simulations take a lot of time, even days. By using two classes of algorithms, recurrent neural networks and random forests, we developed a machine learning based alternative to this computational simulation.
- Developing methods for predicting power conversion efficiency of solar cells using different representations such as chemical formulas and molecular fingerprints as inputs.
Aside from this, my internship at Boeing Cyber-Security Labs (Narus Inc.) uncovered some more general issues that I find fascinating. While at Boeing, I examined how advertisers were targeting different demographics and was involved in building supervised machine learning algorithms that could predict the age range, gender or music interests of a person based on the ads received in each profile. The hardest challenges here were identifying whether a machine learning based approach would be helpful and how to design the input and output variables for modeling it into a machine learning problem. I realized that there is a huge difference between taking a machine learning course and applying it to a real project.
The ML community is obsessed with new algorithms, but we often forget that modeling a problem is one of the biggest challenges in creating machine learning solutions.
G: You’ve previously published research on anonymous social media and the ways that people communicate in these spaces, which has some interesting implications for fields like natural language processing. How do you see this research applying to machine learning?
A: I was working on a couple of research questions in anonymous social media. These were concerned with whether people ask questions about taboos topics on anonymous boards, and how many of the responses were pro-social, not including bullying. After that research was published, I attempted to predict taboo topics using machine learning and natural language processing.
A lot of social media is pseudo-anonymous. If you look at Reddit or Twitter, many accounts do not need to be tightly coupled with an individual’s name. One of the limitations of my work was the size of the dataset. However, social media companies which have bigger datasets are already working on detecting profanity and hate speech using natural language processing. After the widespread failure of election predictions in 2016, pollsters are also increasingly looking towards social media as a better indicator of how people are aligning politically.
G: In your experience, what are the best ways to mine data for projects like this?
A: For my project, I created a library that uses the Facebook Graph API to download posts and comments from Facebook Confessions. However, since the Cambridge Analytica Senate hearings, Facebook has put more restrictions on these libraries.
The best method really depends on the expertise of the programmer. 5 years ago, I spent several months building a system that can automatically collect data using web automation and scraping, so I’m confident about collecting data from any website which shares data publicly. However, most data scientists prefer to collect data using APIs, as the data cleaning stage is significantly reduced. Also, websites like Kaggle do a good job in sharing datasets in a sufficiently structured manner.
G: You’re also working on a variety of other machine learning applications, from manufacturing simulations to measuring the power conversion efficiency of solar cells. Are there any exciting or unexpected uses of machine learning in industry that you wish more people were aware of?
A: I think one of the biggest developments due to the rise of Artificial Intelligence (AI) is interdisciplinary research. I am currently collaborating with different labs in materials science and mechanical engineering at Northwestern as well at other universities.
Oftentimes, the hype in ML surrounds image recognition of either animals or humans, as well as text prediction and generation of social media text. However, the most useful applications of ML and AI are happening in domains such as bioinformatics, health informatics, cheminformatics, and materials informatics among many others.
I saw a good example of this at NeurIPS 2018, where Google DeepMind shared their research on AI for scientific discovery. They unveiled an exciting system called AlphaFold, which builds on years of prior research in using vast genomic data to predict protein structure.
In my own research, I have used machine learning to discover new microstructure candidates for use in materials in airplanes. Also, recent advancements in machine learning around computational datasets have led to a boom in the creation of robots based on reinforcement learning.
G: What are some developments to watch out for in your field over the coming year?
A: I would divide the developments into two parts: algorithmic and operational. On the algorithmic side, I think there will be advancements in reinforcement learning and unsupervised learning in the coming years. Despite the recent success of neural networks, I also think there will be imminent improvements to their interpretability. This is currently a significant barrier to the increasing use of neural networks in our daily lives, so I would expect a decent amount of research to be devoted to this important issue in 2019.
On the operational side, AI and ML are still not used as widely as they can be despite all the hype. A lot of management teams in industry are sceptical of machine learning – in some cases, rightfully so! I think we will see a lot of changes to that attitude in coming years. We have the capabilities to automate a lot of things but we are still far from actually doing it. As perspectives on machine learning change, we will see a growth of infrastructure in many industries. Also, we will be seeing more undergrad level courses in ML and AI in universities around the world. Another growth is in the generation of high quality datasets, as well as the publication and integration of already existing datasets.
G: With a diverse portfolio such as yours, you must have worked with a wide range of data. What are some things that you look for in a dataset and how do you ensure that you have high-quality data?
A: I’ve been fortunate to work on a variety of image, text and structured datasets, sometimes numbering in the millions in terms of data points. In my experience, one thing that beginners often don’t realize is that a high-quality dataset is a dataset with good distribution. For example, if we have to build a classifier which can differentiate a dog from a cat, we need a wide range of images of each animal in that dataset. Ideally, this should include real life edge cases: cats which look like dogs and dogs which look like cats.
An algorithm is only as good as the data on which it is modeled.
Also, data scientists spend roughly 50-80% of their time simply cleaning data. I think that we can expect a growth of services and libraries which do automatic data cleaning and pre-processing in the near future.
G: Where do you normally source your data from?
A: I usually get my data from a mixture of the following:
- Websites like GitHub and Kaggle which share datasets
- Repositories shared by researchers on their webpages
- Using APIs to collect data
- Scraping data from websites
I also had the opportunity to work with a human-in-the-loop system on the aforementioned anonymous social media project. Using a team of undergraduate students, we labeled a dataset of around 3000 Facebook posts and comments. We used them to build a machine learning system that classified posts as taboo and non-taboo. The human-in-the-loop system not only helped us to achieve a decent accuracy, but also allowed us to investigate the posts which were misclassified and, ultimately, build a better classifier.
G: Finally, do you have any advice for anyone looking to get involved in building machine learning algorithms?
A: I often see practitioners trying out neural networks before they truly understand the distribution of their data or the impact of the size of their datasets. I have worked with datasets numbering as small as 350 and as big as several million data points. I think unless the data is very clearly solvable via deep learning, as in image recognition, one should slowly experiment with simpler algorithms. Simpler algorithms usually overfit less and are, more importantly, easy to explain. The European Union have already taken initiatives to enforce that one should be able to explain the algorithms one uses. My recommendations are as follows:
- Time Series: Try ARMA, ARIMA before trying recurrent neural networks
- Text Prediction: Try Naive Bayes, RandomForest before trying LSTM/1-D CNN with Word2vec/Glove/FastText
- Structured Data (Vector Input): Try regularized linear regression (ridge) and random forest before fully connected neural networks.
- Image Recognition: Here neural networks win. However, try conventional convolutional networks before residual networks and capsule networks.
Apart from this, my internship at an insurance company made me realize the importance of interpretability in the sense of a business application. Companies want to see why a prediction is good or bad. Accuracy and precision are important but not sufficient. They want to know the reasons behind a model’s mislabeling of data points, as well as the exact way in which things change as we make improvements. We don’t want to create a model that is so sensitive to the training data that it fails in real life. In industry, at the very least, data scientists go through the predictions of their model on real life data and examine where they are failing. However, without growth in interpretability, machine learning will hit a roadblock in many sectors.