Thinking about your audience
Published:
Think About Your Audience
This is the third in a series of blog posts related to the Outreachy internship, and as a participant in the Ersilia project, I will tell you about the tasks that I’m working as internt during this internship.
We are currently aware of the technological development and transformation in the world presented by machine learning (ML) and artificial intelligence (IA), and thanks to these tools Ersilia can carry out its objectives in the incorporation of AI and ML models for drug discovery. This entire Ersilia tech ecosystem presents a somewhat steep learning curve for newcomers like me, but every time I learn something new I feel like that curve is flattening and I’m moving a bit away from the newcomer perspective.
At the beginning, any project can be overwhelming, and in the case of Ersilia, for many applicants there may be new concepts, many of them related to chemistry and machine learning tools, but curiosity, interest and continuity in learning throughout the internship will make the process rewarding in terms of task completion. From this point of view, and thinking about the title of this publication, “Think about your audience”, it is important for me to talk to you about the process to incorporate models in Ersilia.
Ersilia has an open source repository, called Ersilia Model Hub, it is public and contains artificial intelligence and machine learning models, some of the new models we are working on are models from the scientific literature that are developed by others. Although Ersilia has many categories of models in this repository, I will only tell you about one category, the models from the literature that are pre-trained and in which I have worked so far.
In the first stage, mentors from their field of medical science provide us with the scientific literature, which is inclined towards the application of machine learning in drug design. Interpreting the literature results in incorporating various models of predictive, generative, and translation artificial intelligence; Depending on the type of tools that are used, knowing their inputs and knowing the type of output, understanding the training data, the ML algorithm used, are important elements to take into account for the subsequent steps in the incorporation of the model.
In the second stage, testing the original model in our system from the command console is the next step. Taking into account that Ersilia runs on an Ubuntu operating system, in a conda environment with a version of python installed equal to 3.7. We proceed to install the dependencies within the aforementioned requirements, we execute the original model following the installation steps and obtain the expected results. In some cases there are installation problems, many of them related to dependency conflicts between different packages, some like in my case, my conda environment was failing generating conflict errors between them, and what led me to have to re-install the conda environment, as an easy and fast solution to advance in the development of the incorporation of the model. Identifying these errors is not an easy task for a newcomer and can become frustrating, luckily the support of mentors is essential to solve these problems. But if the installation runs correctly we can continue with the next step.
At this point in the process of incorporating models into Ersilia, I would like to mention that I found it valuable to write in a .txt (or text file), with the dependencies that I installed to run the model correctly. Even the ones I installed after solving module import problems that occurred when running some models. In order to take them into account for the Docker container, since Ersilia uses a Docker file to specify the installation instructions.
Another thing that I find useful is to create an input/output adapter, which is a script that allows you to read the input molecules from a .csv (or a plain text file) and write the predictions to a .csv file and test. the model with this adapter. This made it easier and faster for me to implement the model in Ersilia.
For the following stages of incorporating the model, Ersilia has a template that is automatically generated when our mentor approves it, it means that through Github repository, we open a Pull Request of the model, we provide the description of the model and the resulting tests of the execution. Once reviewed by the community coordinators, they approve the model, from Github, and this automatically creates a repository with the id of the model. This repository is forked in the personal repository, then cloned, to make the modifications in my local and then merge them with the model repository in Ersilia. In this process, it is essential to read the guide provided by Ersilia to understand the folders and files contained in this template. What is very important is that we must know exactly where to copy the code and the pre-trained models, which code goes in which folder. The main.py file is the main file, which is the only one that we must modify, it depends on each model what other modifications must be made. Ersilia is also working on including the model input in the backend (server or models) automatically, so we don’t have to wait for the platform administrators to add it manually. Finally, we can test the model on our local system through the –repo_path and upload all the code in the repository and then merge it with Ersilia’s, so that it can finally be used by other people such as data scientists, experimental researchers, biomedics, and others general public.