• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Skip to footer
  • Articles
  • News
  • Events
  • Advertize
  • Jobs
  • Courses
  • Contact
  • (0)
  • LoginRegister
    • Facebook
    • LinkedIn
    • RSS
      Articles
      News
      Events
      Job Posts
    • Twitter
Datafloq

Datafloq

Data and Technology Insights

  • Categories
    • Big Data
    • Blockchain
    • Cloud
    • Internet Of Things
    • Metaverse
    • Robotics
    • Cybersecurity
    • Startups
    • Strategy
    • Technical
  • Big Data
  • Blockchain
  • Cloud
  • Metaverse
  • Internet Of Things
  • Robotics
  • Cybersecurity
  • Startups
  • Strategy
  • Technical

AI and Machine Learning in Bioinformatics

Terry Wilson / 12 min read.
February 13, 2023
Datafloq AI Score
×

Datafloq AI Score: 84

Datafloq enables anyone to contribute articles, but we value high-quality content. This means that we do not accept SEO link building content, spammy articles, clickbait, articles written by bots and especially not misinformation. Therefore, we have developed an AI, built using multiple built open-source and proprietary tools to instantly define whether an article is written by a human or a bot and determine the level of bias, objectivity, whether it is fact-based or not, sentiment and overall quality.

Articles published on Datafloq need to have a minimum AI score of 60% and we provide this graph to give more detailed information on how we rate this article. Please note that this is a work in progress and if you have any suggestions, feel free to contact us.

floq.to/wYRKS

Advancements in DNA sequencing techniques enabled researchers to sequence the human genome in just a day, a task that consumed around a decade with the traditional approaches. This is only one of many powerful contributions of machine learning in bioinformatics.

As many biotech companies hire ML consultants to facilitate the process of handling biomedical data, the AI in bioinformatics market continues to grow. It is predicted to reach $37,027.96 by 2029, growing at a CAGR of 42.7% from 2022. Do you want to be a part of this digital revolution?

This article gives a brief introduction to ML, explains how it supports biomedical research, and enumerates the challenges you might face deploying this technology.

Introduction to machine learning for bioinformatics

Machine learning is a subset of the broader field of artificial intelligence (AI). It enables systems to independently learn from data and execute tasks that they are not explicitly programmed to handle. Its goal is to give machines the ability to perform tasks that require human intelligence, such as diagnosing, planning, and predicting.

There are two main types of machine learning:

  1. Supervised learning relies on labeled datasets to teach algorithms an existing classification system and how to make predictions based on it. This ML type is used to train decision trees and neural networks.
  2. Unsupervised learning doesn’t use labels. Instead, algorithms try to uncover data patterns on their own. In other words, they learn things that we can’t teach them directly. This is comparable to how the human brain works.

It’s also possible to combine labeled and unlabeled data during training, which will result in semi-supervised learning. This ML type can be useful when you don’t have enough high-quality labeled data for a supervised learning approach, but you still want to use it to direct the learning process.

What are the most popular machine learning techniques used in bioinformatics?

Some of these algorithms fall strictly under the supervised/unsupervised learning categories, and some can be used with both methods.

Natural language processing

Natural language processing (NLP) is a set of techniques that can understand unstructured human language.

NLP can search through volumes of biology research, aggregate information on a given topic from various sources, and translate research findings from one language to another. In addition to mining research papers, NLP solutions can parse relevant biomedical databases.

NLP can benefit the bioinformatics field in the following ways:

  • Interpreting genetic variants
  • Analyzing DNA expression arrays
  • Annotating protein functions
  • Looking for new drug targets

Neural networks

This is a multi-layered structure consisting of nodes/neurons as its building blocks. Neurons in adjacent layers are connected to each other via links, but neurons of the same layer are not interlinked. The input layer neurons receive information, process it, and pass it along as an input to the next layer. And this process continues until the processed information reaches the output layer.

The most basic neural network is called perceptron. It consists of one neuron that acts as a classifier. This neuron receives an input and places it in one of two classes using a linear discrimination function. In larger neural networks, there is no limit on the number of layers or the number of nodes in one layer.

Neural networks

  • Classifying gene expression profiles
  • Predicting protein structure
  • Sequencing DNA

Clustering

Unsupervised clustering is the process of organizing elements into various groups based on the supplied definition of similarity. As a result of such classification, the elements positioned in one cluster closely relate to one another, and differ from elements in other clusters.

Unlike with supervised classification, in clustering, we don’t know in advance how many clusters will be formed. One famous example of this machine learning approach in bioinformatics is microarray-based expression profiling of genes, where genes with similar expression levels are positioned in one cluster.

Source

Dimensionality reduction

In machine learning classification problems, classifications are performed based on factors/features. Sometimes there are too many factors that affect the final result, making the dataset difficult to visualize and manipulate. Dimensionality reduction algorithms can minimize the number of features, making the dataset more manageable. For instance, a climate classification problem might have humidity and rainfall among its features. These two can be collapsed into one factor for the sake of simplicity as they are both closely related.

Dimensionality reduction has two main components:

  • Feature selection. Chooses a subset of variables to represent the entire model by embedding, filtering, or wrapping features.
  • Feature extraction. Reduces the number of dimensions in a dataset. For instance, a 3D space can be broken into two 2D spaces.

This type of algorithms is used to compress large datasets for the sake of reducing computational time and storage requirements. It can also eliminate redundant features present in the data.

Decision tree classifiers

This is one of the most popular classical supervised learning classifiers. These algorithms apply a recursive approach to build a flowchart-like tree model, where each node represents a test on a feature. First, the algorithm determines the top node – the root – and then builds the tree recursively considering one parameter at a time. The final node in each sequence is called “the leaf node.” It represents the final classification and holds the class label.

Decision tree models demand high computational power during training, but afterwards they can perform classifications without extensive computing. The main advantage these classifiers bring to the bioinformatics field is that they generate understandable rules and explainable results.

Source

Support vector machine

This is a supervised ML model that can solve two-group classification problems. To classify data points, these algorithms look for an optimal hyperplane that divides the data separating it into two classes with the maximum distance between data points.


Interested in what the future will bring? Download our 2023 Technology Trends eBook for free.

Consent

Source

The points located on either sides of the hyperplane belong to different classes. The hyperplane’s dimension depends on the number of features. In the case of two features, the decision boundary is a line, with three features, it’s a 2D plate. This characteristic makes it hard to use SVM for classifications with more than three features.

This approach is useful in computational identification of functional RNA genes. It can select the optimal set of genes for cancer detection based on their expression data.

Top 5 applications of machine learning in bioinformatics

After giving a brief introduction to machine learning and highlighting the most commonly used ML algorithms, let’s see how they can be deployed in the bioinformatics field.

If any of these use cases catches your attention, turn to AI software consulting professionals to implement a customized solution for your business.

1. Facilitating gene editing experiments

Gene editing refers to manipulations on an organism’s genetic composition by deleting, inserting, and replacing a part of its DNA sequence. This process typically relies on the CRISPR technique, which is rather effective. But there is still much improvement to be desired in the area of selecting the right DNA sequence for manipulation, and this is where ML can help. Using machine learning for bioinformatics, researchers can enhance the design of gene editing experiments and predict their outcomes.

A research team employed ML algorithms to discover the most optimal combinational variants of amino-acid residues that allow genome-editing protein Cas9 to bind with the target DNA. Due to the large number of these variants, such an experiment would have been too large, but using an ML-driven engineering approach reduced the screening burden by around 95%.

2. Identifying protein structure

Proteomics is a study of proteins, their interactions, composition, and their role in the human body. This field involves heavy biological datasets and is computationally expensive. Therefore, technologies like machine learning in bioinformatics are essential here.

One of the most successful applications in this field is using convolutional neural networks to position proteins’ amino acids into three classes – sheet, helix, and coil. Neural networks can achieve an accuracy of 84% with the theoretical limit being 88%-90%.

Another usage of ML in proteomics is protein model scoring, a task essential to predict protein structure. In their machine learning approach to bioinformatics, researchers from the Fayetteville State University deployed ML to improve protein model scoring. They divided protein models under question into groups and used an ML interpreter to decide on the feature vector to evaluate models belonging to each group. These feature vectors were used later to further improve the ML algorithms while training them on each group separately.

3. Spotting genes associated with diseases

Researchers increasingly use machine learning in bioinformatics to identify genes that are likely to be involved in particular diseases. This is achieved by analyzing gene expression microarrays and RNA sequencing.

Particularly, gene identification gains traction in cancer-related studies to identify genes that are likely to contribute to cancer, as well as classify tumors by analyzing them on a molecular level.

For instance, a group of scientists at the University of Washington used several machine learning in bioinformatics algorithms, including decision tree, support vector machine, and neural networks to test their ability to predict and classify cancer types. Researchers deployed RNA sequencing data from The Cancer Genome Atlas project, and discovered that linear support vector machine was the most precise, hitting the 95.8% accuracy in cancer classification.

In another example, researchers used ML to classify breast cancer types based on gene expression data. This team also relied on the Cancer Genome Atlas project’s data. The researchers classified the samples into triple negative breast cancer – one of the most lethal breast cancers – and non-triple negative. And once again, the support vector machine classifier delivered the best results.

Speaking of non-cancerous diseases, researchers at the University of Pennsylvania relied on machine learning to identify genes that would be a suitable target for coronary artery disease (CAD) drugs. The team used the ML-powered Tree-based Pipeline Optimization Tool (TPOT) to pinpoint a combination of single nucleotide polymorphisms (SNPs) related to CAD. They analyzed the genomic data from the UK Biobank and uncovered 28 relevant SNPs. The relation between the SNPs on top of this list and CAD was previously mentioned in the literature, and this research gave a practical validation.

4. Traversing the knowledge base in search of meaningful patterns

Advanced sequencing technology doubles genomic databases each 2.5 years, and researchers are looking for a way to extract useful insights from this accumulated knowledge. Machine learning in bioinformatics can sift through biomedical publications and reports to identify different genes and proteins and search for their functionality. It can also aid in annotating protein databases and complement them with the information it retrieves from the literature.

One example comes from a group of researchers who deployed bioinformatics and machine learning in literature mining to facilitate protein model scoring. Structural modeling of protein-protein dockings typically results in several models that are further scored based on structural constraints. The team used ML algorithms to traverse PubMed papers on protein-protein interactions, searching for residues that could help generate these constraints for model scoring. And to make sure that the constraints are relevant, scientists explored the ability of different machine learning algorithms to check all discovered residues for relevancy.

This research revealed that both computationally expensive neural networks and less resource demanding support vector machine achieved very similar results.

5. Repurposing drugs

Drug repurposing, or reprofiling, is a technique scientists use to discover new applications of existing drugs that they were not intended for. Researchers adopt AI in bioinformatics to perform drug analysis on relevant databases, such as BindingDB and DrugBank. There are three major directions for drug repurposing:

  • Drug-target interaction looks into the drug’s ability to bind directly to the target protein
  • Drug-drug interaction investigates how medications act when they are taken in combinations
  • Protein-protein interaction looks into the surface of interacting intracellular proteins, and attempts to discover hotspots and allosteric sites.

Researchers from the China University of Petroleum and the Shandong University developed a deep neural network algorithm and used it on the DrugBank database. They wanted to study drug-target interactions between drug molecules and the mitochondrial fusion protein 2 (MFN2), which is one of the main proteins that can possibly cause Alzheimer’s disease. The study identifies 15 drug molecules with binding potential. Upon further investigation, it appeared that 11 of them can successfully dock with MFN2. And five of them have medium to strong binding force.

Challenges presented by machine learning in bioinformatics

Machine learning in bioinformatics differs from ML in other sectors due to the four factors below, which also constitute the main challenges of applying ML to this field.

  1. Bioinformatics AI is expensive. For the algorithm to perform properly, you need to acquire a large training dataset. However, it’s rather costly to obtain 10,000 chest scans, or any other type of medical data for that matter.
  2. Difficulties associated with the training datasets. In other fields, if you don’t have enough training data, you can generate synthetic data to expand your dataset. However, this trick might not be appropriate when it comes to human organs. The problem is that your scan generation software might produce a scan of a real human. And if you start using that without the person’s permission, you will be in gross violation of their privacy.
  3. Another challenge associated with the training data is that if you want to build an algorithm that works with rare diseases, there will not be much data to work with in the first place.
  4. The confidence level must be very high. When human life depends on the algorithm’s performance, there is just too much at stake, which does not leave room for error.
  5. Explainability issue. Doctors will not be open to using the ML model if they don’t understand how it produced its recommendations. You can use explainable AI instead, but these algorithms are not as powerful as some black-box unsupervised learning models.

For general AI-associated challenges and implementation tips, check out our article and a free eBook.

To sum up

AI and ML technologies have many applications in the medicine and biology fields. On our blog, you can find more information on artificial intelligence in clinical trials, AI in cancer diagnosing and treatment, and benefits of AI in healthcare.

Bioinformatics is another medicine-related field where ML and AI-based medical solutions come handy. Bioinformatics requires handling large amounts of various data, such as genome sequences, protein structures, and scientific publications. ML is well-known for its data processing capabilities, however, many AI bioinformatics models are expensive to run. It can take hundreds of thousands of dollars to train a deep learning algorithm. For instance, training AlphaFold2 model for protein structure prediction consumes an equivalent of 100-200 GPUs running for several weeks.

You can find more information on what to expect price-wise in our article on how much is costs to implement AI.

If you want to deploy machine learning in bioinformatics, drop us a line. We will work together with you to find the best-suited ML models for a reasonable budget.

Considering to deploy machine learning in bioinformatics, but not sure which model is right for you? Get in touch! We will assist you in selecting the best-suited ML type for the task. We’ll also help you build/customize, train, and deploy the algorithm.

Categories: Artificial Intelligence
Tags: Artificial Intelligence, bioinformatics, Data, machine learning
Credit: Unsplash

About Terry Wilson

Terry is an Innovation Analyst at ITRex, an emerging technology development and consulting company.

Primary Sidebar

E-mail Newsletter

Sign up to receive email updates daily and to hear what's going on with us!

Publish
AN Article
Submit
a press release
List
AN Event
Create
A Job Post

Related Articles

Explaining data products lifecycle and their scope in management

March 28, 2023 By yash.mehta262

How BlaBlaCar Built a Practical Data Mesh to Support Self-Service Analytics at Scale

March 23, 2023 By Barr Moses

The need for extensive data to make decisions more effectively and quickly

March 23, 2023 By Rosalind Desai

Related Jobs

  • Software Engineer | South Yorkshire, GB - February 07, 2023
  • Software Engineer with C# .net Investment House | London, GB - February 07, 2023
  • Senior Java Developer | London, GB - February 07, 2023
  • Software Engineer – Growing Digital Media Company | London, GB - February 07, 2023
  • LBG Returners – Senior Data Analyst | Chester Moor, GB - February 07, 2023
More Jobs

Tags

AI Amazon analysis analytics application applications Artificial Intelligence BI Big Data business China Cloud Companies company costs crypto Data design development digital engineer environment experience finance financial future Google+ government Group health information machine learning mobile news public research security services share skills social social media software strategy technology

Related Events

  • 6th Middle East Banking AI & Analytics Summit 2023 | Riyadh, Saudi Arabia - May 10, 2023
  • Data Science Salon NYC: AI & Machine Learning in Finance & Technology | The Theater Center - December 7, 2022
  • Big Data LDN 2023 | Olympia London - September 20, 2023
More events

Related Online Courses

  • Big Data – Capstone Project
  • Oracle Cloud Data Management Foundations Workshop
  • Webinar – How to harness financial data to help drive improved analytics and insights with Envestnet & AWS
More courses

Footer


Datafloq is the one-stop source for big data, blockchain and artificial intelligence. We offer information, insights and opportunities to drive innovation with emerging technologies.

  • Facebook
  • LinkedIn
  • RSS
  • Twitter

Recent

  • Personalization Vs. Hyper-Personalization: Benefits, Limitations and Potential
  • Explaining data products lifecycle and their scope in management
  • Microsoft Power BI -The Future of Healthcare’s Most Important Breakthrough
  • The Big Crunch of 2025: Is Your Data Safe from Quantum Computing?
  • From Data to Reality: Leveraging the Metaverse for Business Growth

Search

Tags

AI Amazon analysis analytics application applications Artificial Intelligence BI Big Data business China Cloud Companies company costs crypto Data design development digital engineer environment experience finance financial future Google+ government Group health information machine learning mobile news public research security services share skills social social media software strategy technology

Copyright © 2023 Datafloq
HTML Sitemap| Privacy| Terms| Cookies

  • Facebook
  • Twitter
  • LinkedIn
  • WhatsApp

In order to optimize the website and to continuously improve Datafloq, we use cookies. For more information click here.

settings

Dear visitor,
Thank you for visiting Datafloq. If you find our content interesting, please subscribe to our weekly newsletter:

Did you know that you can publish job posts for free on Datafloq? You can start immediately and find the best candidates for free! Click here to get started.

Not Now Subscribe

Thanks for visiting Datafloq
If you enjoyed our content on emerging technologies, why not subscribe to our weekly newsletter to receive the latest news straight into your mailbox?

Subscribe

No thanks

Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.

Necessary Cookies

Strictly Necessary Cookie should be enabled at all times so that we can save your preferences for cookie settings.

If you disable this cookie, we will not be able to save your preferences. This means that every time you visit this website you will need to enable or disable cookies again.

Marketing cookies

This website uses Google Analytics to collect anonymous information such as the number of visitors to the site, and the most popular pages.

Keeping this cookie enabled helps us to improve our website.

Please enable Strictly Necessary Cookies first so that we can save your preferences!