LLMs and Antibodies

6 minute read

Published: March 04, 2023

png

motivation

I wanted to understand if LLM / PLMs can be useful for analyzing antibody sequence data. Antibodies are proteins made by the immune system which are critical for protection against viruses. Here I used antibody sequencing data, AbLang (Antibody LLM), and single-cell rna sequencing data investigate.

protein language models

Protein language models are inspired by natural language processing (NLP) models like GPT-4. They are designed to analyze and predict protein sequences, which are chains of amino acids with distinct functions in living organisms. These models are trained on large datasets of known protein sequences and appear to learn general truths and patterns about protein space through relatively simple and unsupervised training procedures. By understanding these patterns, the models can predict protein structure and function. Similar to NLP models, protein language models use a technique called “attention” to weigh the importance of different amino acids in the sequence, allowing them to capture long-range interactions and dependencies between amino acids. I wanted to understand how these models can be applied to studying antibody evolution in particular.

Ablang Description

jpeg

Ablang is a protein language model trained specifically on antibodies which was readily available to me and easy to install at the time. There’s a strong argument to made for using a more general protein language model. At the least, one should compare the results here to something like ESM from Meta.

Additional Background about Antibodies at the bottom:

All that you’d need to know to understand what I’m going to show in this notebook is that generally speaking more somatic hypermutation means the antibody in question has been through more of this evolutionary process than an antibody that has no somatic hypermutation

Overview of my investigation:

associate the single cell transcriptome data with the embeddings calculated using AbLang for each B cell
perform PCA on the LLM embeddings
ad-hoc generation separation of similar antibody-types in the PCA space
differential gene-expression analysis of B cells using these LLM defined clusters
visualize embeddings with UMAP

LLMs, Antibodies, and Single Cell Genomics

A Quick Look at the Dataset

png

So we have hundreds to thousands of single B cells in each sample with each B cells having an RNA transcriptome and an Antibody Sequence associated with it

I’m hoping this helps orient – we can see the single cell RNA sequencing data coheres in a global way with the Antibody Sequence via these UMAPs:

I’m plotting UMAP projections of the single cell RNA sequencing data colored by metadata about the transcriptional state (far left) or the antibody sequence (right two)2. Memory B cells have previously participated in immune respones and thus have acquired hypermutations in the antibody sequence. Naive B cells have no mutation in their antibody sequence.
The antibody mutation level is determined by asking how far away any given antibody is in nucleotide spacefrom our best guess of the VDJ genes which constructed it.
Memory B cells also appear to have switched Antibody Constant Region usage, another sign of they have already participated in the immune response.

png

We can see my annotations of the antibody sequence are clearly related in a global way to the B cell Type determined via single-cell RNA sequencing. For example, cells which are Memory B cells most ofter have antibody sequences I annotated as hypermutated.

I’m curious whether the sequence encoding of the antibody sequence can also meaningfully represent a known phenomenon like hypermutation or constant region usage.Now I’m plotting the same categorical variables from the UMAPS on the PCA of the antibody sequence embedding calculated using AbLang

png

This is interesting, it is clear the language model has learned what hypermutation looks like and pretty neatly separates antibodies with no mutations from mutated antibodies using PC1 and PC3. Furthermore, on PC3, there appears to be continuous varation in the hypermutated antibody space, but it’s clear there is an arrow that can be drawn from predominantly IGHM class antibodies to the other type (e.g. IGHA1). So the model appears to capture something meaningful about antibodies.

Let’s look at the relationship between hypermutation and the the PCs in a quantitative sense, in particular I’m interested in PC3, which appears to separate IGHM antibodies from the others. Does this separation exist in a meaningful way in gene expression space as well?

png

As an aside, we can see a fair amount of “no mutation” antibody sequences (highighted in the red box or seen in the tail of the orange distribtion) that actually appear more like mutated antibodies in the PC space. I asked if the gene expression of cells with that profile looked like Memory B cells, which would corroborate that the antibody actually is hypermutated. Indeed these cells with “no mutations” appear to have a memory-like phenotype, suggesting the language model identifies mutations in antibodies more sensitively than how I orginally labeled the mutation status, via comparison to a genomic database.

I noticed from some exploratory plots that the while the distribution of hypermutated antibodies on PC3 doesn’t appear to have any notable features, if I subset different cell types, multiple modes appear. For example, hypermutated Memory B cells appear to have multple two modes in PC3

png

I split the hypermutated Memory B cells at 1 in PC3 space and asked if there were gene expression differences between the groups

png

Indeed there are clear differences, with PC3 low having a IGHM memory phenotype and PC3 high deeper or class-switched memory phenotype

Let’s finally take a quick look at the UMAP projection of the antibody sequence encodings

png

Looks like the UMAP separates antibody sequences by V gene usage in addition to the hypermutation levels.

We can see that the same V gene generally clusters together, but within these broad gene families the mutated sequences cluster together or apart from the no mutations sequences

Summary:

I was curious about large language models, so I downloaded one and applied it to data I’ve generated, I found that:

the language model encodes antibody hypermutation on 2 relatively orthongonal axes
one is just whether or not there are any mutations
the other is how many mutations there are
ad-hoc classification of antibody encoding types generally coheres with my orthogonal measurement of gene expression

To do:

Investigate the distance in antibody encoding space between related antibody sequences, is that distnace meaningfully smaller than a reasonable null expection?
train LLM on all of antibody space (e.g. mouse, camelid, monkey) vs just human (which Ablang has done). It’s a poorly formed idea, but this could be used to perform comparative immunogenomics in a less-biased manner
train LLM only on memory antibody space (Ablang has used unmutated and mutated antibodies and I wonder if the large amount of unmutated antibodies may obscure something about hypermuation and antibody evolution)

Antibody Background

An antibody is a protective protein produced by the B cells in the immune system to neutralize and defend against harmful pathogens, such as viruses. Structurally, antibodies are composed of four polypeptide chains—two identical heavy chains and two identical light chains—linked by disulfide bonds, forming a Y-shaped m/images/posts/nest-map/olecule. The unique feature of antibodies lies in their variable region, located at the tips of the Y-shaped structure, which is responsible for binding specifically to the foreign substance, or antigen.

The variable region is generated stochastically through a process called V(D)J recombination. This process involves the pseudo-random selection, recombination, and joining of gene segments—Variable (V), Diversity (D), and Joining (J) segments—along with the introduction of pseudo-random nucleotide insertions and deletions between the recombined genes. This generates astromomical standing antibody sequence diversity which could never be encoded in the genome. Selection of particular antibodies from this diversity allows indivuduals to adapt their immune system to never-before-seen pathogens.

Antibody Evolution

The antibody generation process creates a large pool standing diversity in protein space. Next, the immune system uses an evolutionary process to select and amplify the antibodies which have the most beneficial properties, such as strongly binding and neutralizing a pathogen which is wreaking havoc in your body. During the selection process, additional diversity is generated by mutating the selected antibodies, allowing the immune system iterate on creating an even better protein: it selects the best of the selected. This process is called somatic hypermuation, and it’s pretty much fascinating.

Share on

Twitter Facebook LinkedIn

A Minimal LaTeX Job Description Template

3 minute read

Published: November 19, 2024

Inspired by both the need to hire someone and this nice aesthetic for job descriptions from b.next, I recently created a minimal LaTeX template for formatting job descriptions. The template hopefully provides a clean, professional, minimalistic layout that someone else could use.

Preview

Job Description Template Preview

You can view the full template here.

Template Structure

The template is organized into several key sections:

Job Title and Basic Information
Position Overview
Key Responsibilities
Required Qualifications
Additional Information (optional)

LaTeX Source

Here’s the minimal LaTeX code to create this template:

\documentclass[11pt,a4paper]{article}
\usepackage[margin=1in]{geometry}
\usepackage{titlesec}
\usepackage{enumitem}
\usepackage{hyperref}
\usepackage[dvipsnames]{xcolor}
\usepackage{tikz}
\usepackage{tikzpagenodes}
\usepackage{graphicx}
\usepackage{transparent}

% Define custom beige color
\definecolor{lightbeige}{RGB}{245, 245, 220}

% Set page color and text color
\pagecolor{lightbeige}
\color{black}

% Define custom section style
\titleformat{\section}
  {\Large\bfseries}
  {}
  {0em}
  {}
  []

% Adjust hyperlink color
\hypersetup{
  colorlinks=true,
  linkcolor=black,
  filecolor=black,
  urlcolor=black,
}

% Define the double frame drawing command
\AddToHook{shipout/background}{%
  \begin{tikzpicture}[remember picture,overlay]
    % Outer rectangle
    \draw[black,line width=0.5pt] 
      ([xshift=0.2in,yshift=-0.2in]current page.north west) 
      rectangle 
      ([xshift=-0.2in,yshift=0.2in]current page.south east);
    
    % Inner rectangle
    \draw[black,line width=0.5pt] 
      ([xshift=0.3in,yshift=-0.3in]current page.north west) 
      rectangle 
      ([xshift=-0.3in,yshift=0.3in]current page.south east);
  \end{tikzpicture}%
}

\begin{document}

% Header with line and logo
\begin{tikzpicture}[remember picture,overlay]
  % Black line
  \draw[black, line width=1pt] 
    (current page text area.north west) -- 
    (current page text area.north east);
  
  % Logo placeholder
  \node[anchor=north east, yshift=45pt] at (current page text area.north east) {
    [COMPANY LOGO]
  };
\end{tikzpicture}

\vspace*{0.5in}

% Job Title (aligned to the left margin)
\noindent\textbf{\Large [JOB TITLE]}

\vspace{0.1em}
\noindent\large [LOCATION]

\vspace{0.2em}
\noindent [WORK TYPE] @ [OFFICE LOCATION]

\vspace{2em}

\noindent [COMPANY NAME] is optimizing the synergistic paradigms of cross-functional deliverables through recursive stakeholder engagement. Our mission is to leverage agile methodologies to disrupt old markets while organizing new ones. \newline \newline
\noindent Our leadership team emerged fully-formed from various merger-acquisition events, including:
\begin{itemize}
  \item A former Chief Innovation Officer who pioneered the the LinkedIn engagement bait post where you say something shocking and then heavily qualify it
  \item Three consecutive quarters of record-breaking paradigm shifts as measured by KPIs
  \item Several highly qualified individuals whose roles are described in recursive acronyms
\end{itemize}

\section*{The Role}
As [JOB TITLE], you'll be accountable for driving accountability while ensuring all metrics are properly accounted for. Success in this role requires the ability to simultaneously move both upstream and downstream, while maintaining a lateral trajectory across all verticals. We want you to simultaneously create and destroy frameworks across our technical stack.

\section*{Your Work}
\begin{itemize}
  \item Optimize the efficiency of our optimization protocols
  \item Manage the strategic oversight of strategy implementation
  \item Leverage cross-functional synergies to facilitate greater organization cohesivity
  \item Generate reports about the status of reports about report generation
  \item Maintain KPI performance indicators other metrics which measuring our performance
  \item Facilitate the streamlining of our facilitation processes
  \item Chair the committee on committee efficiency
\end{itemize}

\section*{Who You Are}
\begin{itemize}
  \item Master's degree (or similar experience) in Strategic Strategizing or equivalent years of experience in Factorio
  \item Proven track record of recording tracks and tracking records
  \item Demonstrated ability to circle back on circling back
  \item Expert-level proficiency in taking things offline while remaining online
  \item Strong understanding of understanding strong understandings
  \item Certified in certifying certifications
  \item Results-oriented approach to orienting results toward result orientation
  \item Capable of managing up, down, and sideways simultaneously while remaining perpendicular to objectives
\end{itemize}

\section*{Benefits \& Perks}
\begin{itemize}
  \item Competitive salary of [SALARY RANGE] subject to quarterly readjustment based on the performance of performance metrics
  \item Health insurance (pending approval from the Committee on Insurance-Related Health Matters)
  \item Unlimited PTRPTO (Permission to Request Permission for Time Off)
  \item Access to our proprietary coffee-to-meeting conversion pipeline
  \item Professional development opportunities to develop professional opportunity development
  \item Ping pong table (pending approval from the Recreational Asset Utilization Oversight Board)
\end{itemize}

\section*{Our Investors \& Partners}
\begin{itemize}
  \item Various venture capital entities currently undergoing entity verification
  \item Strategic partnerships with partners
  \item Key angels
\end{itemize}

\vspace{1em}
\begin{center}
\textit{Join us in our mission to [COMPANY MISSION STATEMENT] \\
(This posting has been approved by the Department of Posting Approval Approvals)}
\end{center}

\small
\noindent [COMPANY NAME] is an equal opportunity employer, as certified by our Equal Opportunity Employment Opportunity Certification Committee.

\end{document}

Read more

RNA Obelisks

5 minute read

Published: March 11, 2024

novel viroid-like entities revealed by computational biology

These authors would write this paper, and I mean that in the good way. We have Andy Fire, who is famous for a deeply thoughtful career in RNA biology. We have Ami Bhatt: a world-class researcher who leads a group that generates and interprets micro-biome sequencing data. We also have the participation of Robert C. Edgar on the paper — he is, as far as I am aware, the only “independent researcher” in biology that is respected. His open-source software MUSCLES its way into many a phylogenetic analysis. Plus I like his style:

this is the platonic case of a dude who got rich by selling a tech company in the late 90s

Read more

Latex Cover Letter Template

less than 1 minute read

Published: November 20, 2023

$png$

I’m writing a cover letter for my final paper from my PhD. I couldn’t find a latex template so I made one myself. I hope it can be useful to others.

Read more

Dalle and a Combinatoric Collage

5 minute read

Published: May 26, 2023

png

In this little project, I wanted to try out the Dalle API along with some very simple python functions for combination generation and plotting to create a semi-random collage of Dalle edits of American Gothic. I had a good time using the Dalle API and the davinci text API to put together a remixed collage of American Gothic. While coding this up, I used chatGPT to help with the plotting, which added a meta layer to the experience. Next up, I’ll play with the whisper model from OpenAI.

Read more

motivation

protein language models

Ablang Description

Additional Background about Antibodies at the bottom:

Overview of my investigation:

LLMs, Antibodies, and Single Cell Genomics

Summary:

To do:

Antibody Background

Antibody Evolution

Share on

You May Also Enjoy

A Minimal LaTeX Job Description Template

Preview

Template Structure

LaTeX Source

RNA Obelisks

novel viroid-like entities revealed by computational biology

Latex Cover Letter Template

Dalle and a Combinatoric Collage