Fact-checked by Grok 2 weeks ago

Protein sequencing

Protein sequencing is the process of determining the precise order of in a protein or chain, which is fundamental to elucidating its three-dimensional , biological , interactions with other molecules, and in cellular processes. This technique underpins the field of , enabling the identification of proteins in complex biological samples, the study of post-translational modifications, and applications in diagnostics, , and . Unlike , which benefits from the genetic code's redundancy, protein sequencing directly reads the primary sequence without inferring it from nucleic acids, making it indispensable for validating predictions and analyzing non-genomic variations. The of protein sequencing began in the early with initial efforts to analyze composition through , but the first complete of a protein—insulin—was achieved by in the early 1950s using a combination of enzymatic and acid followed by chromatographic separation and identification of fragments. This breakthrough, which demonstrated that proteins have defined rather than random structures, earned Sanger the in 1958. In 1949, Pehr Edman introduced the method, a chemical process that selectively cleaves and identifies the N-terminal of a using phenylisothiocyanate, allowing up to 50-60 residues to be sequenced iteratively with high accuracy. In the 1980s, (MS/MS) emerged as a complementary tool, initially coupled with Edman sequencing, but it soon surpassed it due to its ability to handle smaller samples and generate information from fragmentation patterns. Traditional protein sequencing methods, such as , require purified proteins and are limited to linear N-terminal reading, making them labor-intensive and unsuitable for high-throughput of complex mixtures. Liquid chromatography-tandem (LC-MS/MS) has dominated since the 1990s, serving as the gold standard for by ionizing peptides, fragmenting them via , and inferring sequences from mass-to-charge ratios, with sensitivities reaching femtomolar levels. However, LC-MS/MS faces challenges including limited (typically 10^4 to 10^5), reliance on database matching for identification, and the need for extensive , which can introduce biases. In recent decades, the field has shifted toward next-generation approaches emphasizing single-molecule resolution to overcome these limitations and enable de novo sequencing without prior genomic knowledge. Emerging technologies include fluorosequencing, which adapts with fluorescent labeling for optical detection of at the single-molecule level, and nanopore-based methods, where proteins are unfolded and translocated through a to generate electrical signals distinguishing the 20 based on current blockades or dwell times. As of 2025, advances in AI-driven analysis and technologies have further improved de novo sequencing accuracy and throughput. These innovations, pioneered in the , promise portability, lower costs, and the ability to sequence intact proteins or low-abundance species directly from single cells, though they still grapple with issues like uniform translocation control and discrimination accuracy. Ongoing developments in these areas are poised to transform into a routine tool comparable to .

History and Fundamentals

Historical Development

The early development of protein sequencing began with foundational work on peptide chemistry in the early . In 1901, synthesized the first , glycylglycine, through partial of glycine diketopiperazine, and extended these experiments to analyze protein composition by hydrolyzing polypeptides into constituent , establishing the basis for understanding linkages in proteins. Advances accelerated in the mid-20th century with the development of end-group analysis techniques in the 1940s and , which allowed identification of terminal in polypeptide chains. British biochemist pioneered this approach using 2,4-dinitrofluorobenzene (DNFB) to label N-terminal residues, enabling their separation and quantification. Building on this, Sanger applied and partial acid to determine the complete sequence of insulin in the early , revealing its two-chain structure linked by disulfide bonds—a breakthrough that demonstrated proteins possess defined, genetically encoded sequences. For this work, Sanger received the 1958 . A pivotal milestone came in 1950 with the introduction of by Pehr Edman, a cyclic chemical method that sequentially removes and identifies N-terminal from peptides without disrupting the remaining chain, greatly improving sequencing efficiency over partial . In the 1960s, emerged as a complementary tool for protein sequencing, with early applications by Klaus Biemann in 1966 enabling the analysis of oligopeptides through fragmentation patterns, marking the shift toward instrumental methods. The 1980s and 1990s saw the transition to automated and high-throughput protein sequencing, driven by refinements in such as gas-phase sequencers introduced in the early 1980s, which minimized sample loss and enabled routine analysis of longer polypeptides. By the 1990s, integration with further boosted throughput, supporting large-scale proteomic studies and paving the way for genome-protein correlations.

Basic Principles and Importance

Protein primary structure refers to the linear sequence of in a polypeptide chain, where individual are covalently linked by peptide bonds between the carboxyl group of one and the amino group of the next. This sequence determines the protein's unique identity and serves as the foundation for higher levels of structure, including secondary, tertiary, and quaternary folds that enable biological function. The , which translates nucleotide sequences in into protein sequences, specifies 20 standard using 64 possible codons, with most encoded by multiple codons to provide . These vary in their side chains, conferring diverse chemical properties that influence , stability, and interactions. Determining protein sequences is essential for elucidating protein function, as the primary structure dictates enzymatic activity, binding specificity, and cellular roles. In , sequence comparisons reveal conservation patterns and divergence, illuminating phylogenetic relationships and adaptive changes. For research, sequencing identifies mutations that disrupt function; for instance, a single substitution ( to at position 6) in the beta-globin chain causes sickle cell anemia by altering hemoglobin's solubility and leading to deformation. In , precise sequence knowledge enables targeted therapies, such as monoclonal antibodies or small molecules that bind specific epitopes. Protein sequencing also underpins , the large-scale study of proteomes, facilitating discovery and systems-level insights into cellular processes. Despite its value, protein sequencing faces challenges due to proteins' inherent heterogeneity, where isoforms arise from or genetic variants, complicating uniform analysis. Post-translational modifications (PTMs), such as or , add chemical diversity that can obscure sequences and affect function without altering the . Additionally, proteins range from tens to thousands of amino acids in length—human , for example, comprises over 34,000 residues—posing technical hurdles for complete coverage in long chains. Protein sequencing approaches are broadly classified as direct (de novo) methods, which experimentally determine the amino acid order without prior genomic data, or indirect methods, which predict sequences from DNA/RNA templates or computational models. Direct methods provide empirical validation, especially for novel or modified proteins, while indirect approaches leverage genomic data for efficiency in well-annotated systems.

Amino Acid Composition Analysis

Hydrolysis Techniques

Hydrolysis techniques are essential for determining the composition of proteins, as they cleave bonds to release free for subsequent analysis. These methods must balance complete with minimal degradation or modification of labile residues, though no single approach achieves perfect recovery for all 20 standard . Acid remains the most widely used due to its efficiency, while alternatives address specific limitations such as destruction. Acid hydrolysis typically employs 6 M (HCl) at 110°C for 24 hours in sealed, evacuated tubes to prevent oxidation. This condition achieves near-complete cleavage of bonds for most residues, with recoveries of 86–103% for standard proteins like and (BSA). However, it fully destroys and partially degrades , , , and , while converting and to aspartic and glutamic acids, respectively. To mitigate oxidation of sulfur-containing , additives like 0.4% β-mercaptoethanol or are included. Base hydrolysis, using 4–6 M (NaOH) or (LiOH) at 110–112°C for 16–22 hours, is primarily employed to preserve , which yields recoveries typically 80-100% under optimized conditions compared to none in acid conditions. It is performed in inert atmospheres or with antioxidants like partially hydrolyzed to minimize losses, and results in similar tryptophan values between NaOH and LiOH. This method, however, risks and of other residues and is less suitable for comprehensive composition analysis due to incomplete hydrolysis of certain bonds. Enzymatic hydrolysis offers milder conditions using proteases such as , which cleaves at and residues, or broader enzymes like pronase for near-total breakdown. Conducted at 37–50°C and neutral for 24–72 hours, it preserves labile like and avoids harsh chemical artifacts, but achieves only partial completeness (e.g., underestimating aspartic and glutamic acids) and is more costly for routine total composition work. It is better suited for generating peptides rather than free . Microwave-assisted hydrolysis accelerates traditional acid methods by applying focused energy to 6 M HCl solutions, reducing processing time to 5–30 minutes at 100–150°C while maintaining high and coverage of protein sequences. For instance, it generates up to 1,292 peptides from 2 μg of BSA, enabling faster for spectrometry-based without significant loss in yield compared to conventional 24-hour incubations. Common artifacts in these techniques include , where and convert to aspartic and glutamic acids during acid or base , leading to overestimation of the latter by up to 100% of the former's content. , producing D-isomers from L-amino acids (e.g., 1–4% D-Asp formation), occurs via cyclic intermediates under alkaline or prolonged acidic conditions, particularly affecting and . These modifications necessitate corrections or alternative methods for accurate quantification, often followed by chromatographic separation for residue identification.

Separation and Quantification Methods

Following hydrolysis of proteins into constituent amino acids, separation and quantification methods are essential to determine the molar composition, which serves as a foundational step for inferring sequence information. The classical approach employs ion-exchange chromatography, where amino acids are separated based on their differing affinities for a cation-exchange resin, typically using a gradient of buffers with increasing pH and ionic strength. This method, pioneered by Moore, Stein, and Spackman in 1958, utilizes a single-column, automated system that resolves up to 20 standard amino acids in sequence. Detection occurs post-column via reaction with ninhydrin, producing colored derivatives (purple for most amino acids, yellow for proline) that are quantified spectrophotometrically at 570 nm and 440 nm, respectively. This technique remains a gold standard for its reliability in physiological and protein hydrolysate samples. An alternative, widely adopted method is reverse-phase (RP-HPLC), which offers faster separation and higher throughput compared to ion-exchange. are derivatized pre-column to enhance detectability: phenylisothiocyanate (PITC) forms stable phenylthiocarbamyl (PTC) derivatives detected at 254 nm, as described by Heinrikson and Meredith in 1984. Alternatively, o-phthalaldehyde (OPA) reacts with primary to yield fluorescent isoindoles, enabling sensitive detection via at excitation/emission wavelengths of 340/450 nm, per Jones and Gilligan's 1983 protocol. Separation occurs on a C18 reversed-phase column using an acetonitrile-water , resolving in under 30 minutes. Quantification in both methods relies on peak area integration from chromatograms, calibrated against external standards of known concentrations to generate response factors. This approach achieves accuracy of 1-5% relative standard deviation for most , with internal standards like norleucine correcting for losses or variations. Modern enhancements include ultra-performance liquid chromatography (UPLC), which employs sub-2 μm particles for superior and reduced analysis time to 10-15 minutes. Coupling with (LC-MS/MS) provides confirmatory identification via mass-to-charge ratios, improving specificity for isobaric like and . Results are typically reported as molar ratios of each relative to a reference residue, such as set to 1, facilitating comparison across protein samples and aiding in molecular weight estimation.

Terminal Residue Identification

N-Terminal Analysis

N-terminal analysis focuses on identifying the residue at the free α-amino group of a or chain, providing key insights into protein identity, purity, and processing events such as post-translational modifications. This technique is particularly valuable in early stages of protein characterization, as the N-terminus often reflects the protein's maturation, including of signal peptides or leader sequences. Unlike total amino acid composition analysis, which yields overall residue frequencies, N-terminal methods target the specific endpoint residue, enabling confirmation of sequence starts in heterogeneous samples. The pioneering chemical approach for N-terminal determination was developed by Frederick Sanger in 1945 using 2,4-dinitrofluorobenzene (DNFB), also known as Sanger's reagent. The method involves reacting the intact protein with DNFB under mildly alkaline conditions, where the reagent selectively couples with the unprotonated α-amino group of the N-terminal residue to form a yellow-colored dinitrophenyl (DNP) derivative. Subsequent complete hydrolysis of the labeled protein with acid (e.g., 6 M HCl) breaks all peptide bonds, liberating the DNP-N-terminal amino acid, which remains intact due to its stability under these conditions, while other amino acids are released in free form. The mixture is then separated by two-dimensional paper chromatography, where the DNP-amino acid is identified by its characteristic Rf value and spot color upon comparison with standards. This technique was instrumental in Sanger's elucidation of insulin's structure, identifying phenylalanine as the N-terminal residue of the B-chain and glycine for the A-chain, marking a milestone in proving proteins have defined sequences. Limitations include its destructive nature, as it consumes the entire protein sample, and challenges with lysine residues, which also react to form ε-DNP-lysine, complicating identification. Enzymatic methods offer a milder alternative, utilizing exopeptidases like M or aminopeptidase to sequentially or selectively release the N-terminal . These enzymes catalyze the hydrolysis of the adjacent to the , liberating the free into solution, which is then quantified and identified via techniques such as reversed-phase (HPLC) or post-column derivatization with followed by detection. For instance, controlled incubation with can release one or a few residues, allowing stepwise analysis, though specificity varies—some enzymes prefer hydrophobic residues like or . This approach is advantageous for native proteins, preserving during initial steps, and is often used in combination with inhibitors to limit digestion depth. However, it requires active, unblocked N-termini and can be hindered by secondary or modifications that sterically impede enzyme access. Mass spectrometry-based N-terminal analysis has become a cornerstone of modern due to its sensitivity and ability to handle complex samples. Proteins are typically digested with endoproteases like to generate peptides, followed by (MS/MS), where produces fragment ions. The N-terminal sequence is inferred from b-ions, which retain the charge on the N-terminal fragment and exhibit mass-to-charge ratios differing by the residue masses of successive (e.g., a 14 Da difference for vs. ). Techniques such as electron transfer dissociation (ETD) enhance coverage by generating c-ions, complementary to b-ions, for more robust . This method detects as little as femtomoles of material and can reveal modifications like by mass shifts (e.g., +42 Da for acetyl). Enrichment strategies, such as using negative selection for internal peptides, further isolate N-terminal peptides for targeted analysis. A related chemical strategy, previewed in Pehr Edman's 1950 method, employs phenylisothiocyanate (PITC) to derivatize the N-terminal amino group into a phenylthiohydantoin (PTH) , which is mildly cleaved and identified by , setting the stage for iterative sequencing without full protein destruction. While N-terminal identification alone confirms endpoints, extends this principle to sequential residue determination. Applications of N-terminal analysis span and , particularly in verifying recombinant proteins where the expressed must match the predicted sequence post-cleavage of tags or signal peptides, ensuring functionality and batch consistency. It is also critical for detecting blocked N-termini, such as N-acetylated residues (common in approximately 80-90% of eukaryotic proteins, particularly in humans) or pyroglutamyl formations, which obscure standard sequencing and signal regulatory roles like stability or localization; often resolves these by precise mass mapping. In workflows, it aids sequencing starts and impurity detection in therapeutic proteins.

C-Terminal Analysis

C-terminal analysis in protein sequencing focuses on identifying the residue at the carboxyl , providing essential information for verifying the directionality of the polypeptide and confirming overall integrity. Unlike N-terminal methods, which target the amino group, C-terminal approaches exploit the reactivity of the carboxyl group to release or label the terminal residue sequentially. Early techniques emphasized enzymatic and chemical degradation, while contemporary methods integrate for enhanced precision and throughput. The primary enzymatic approach involves carboxypeptidases, which are exopeptidases that sequentially hydrolyze peptide bonds from the , releasing free that can be quantified over time to deduce the sequence. Carboxypeptidase A (CPA), derived from bovine , preferentially cleaves non-basic, non- residues such as aromatic and aliphatic , making it suitable for initial C-terminal identification in many proteins. Carboxypeptidase B (CPB) complements CPA by specifically targeting basic residues like and at the , allowing for a combined enzymatic strategy to handle diverse terminal sequences. For broader applicability, carboxypeptidase Y (CPY) from is widely used due to its broad substrate specificity, cleaving nearly all C-terminal residues including , though it is often employed for limited sequencing of 5-10 residues to avoid incomplete reactions. A classical chemical for C-terminal is hydrazinolysis, developed by Shiro Akabori in the early 1950s. In this procedure, the protein is treated with anhydrous at elevated temperatures (around 100°C for several hours), which selectively converts the C-terminal carboxyl group to a while internal bonds undergo partial cleavage, yielding free from non-terminal positions that can be separated. The C-terminal is then isolated and identified via or derivatization, such as with dinitrophenyl (DNP) reagents, enabling unambiguous assignment. This , first applied to and proteins like insulin, marked a significant advance in the 1940s-1950s for confirming C-terminal residues without enzymatic biases. In modern workflows, mass spectrometry enhances C-terminal analysis, particularly through ladder sequencing coupled with carboxypeptidase digestion. Time- or concentration-dependent digestion with CPY generates a series of truncated peptides, which are analyzed by matrix-assisted laser desorption/ionization mass spectrometry (MALDI-MS); the mass differences between peaks correspond to specific amino acid residues, revealing the C-terminal sequence. In tandem MS (MS/MS), fragmentation of peptides produces y-ions—characteristic fragments retaining the C-terminus—whose masses allow direct inference of the terminal sequence from the low-mass end of the spectrum. Despite these advances, C-terminal sequencing faces challenges, including slow or incomplete digestion by carboxypeptidases for hydrophobic residues like , , and , which can hinder sequential release and lead to ambiguous results. Hydrazinolysis, while specific, risks partial degradation of sensitive residues such as serine and , and requires conditions to minimize side reactions. These limitations often necessitate orthogonal methods for verification, particularly in complex proteomes.

Edman Degradation

Peptide Fragmentation

Peptide fragmentation is a critical step in protein sequencing, where intact proteins are cleaved into smaller peptides to facilitate subsequent analysis by methods such as or . This process generates manageable fragments typically 5–50 long, allowing for the determination of partial sequences that can be assembled into the full protein sequence. Cleavage is achieved through either enzymatic or chemical means, each offering specific advantages in terms of site selectivity and conditions. Enzymatic digestion employs proteases with defined specificity to hydrolyze peptide bonds under mild aqueous conditions, preserving the integrity of amino acid side chains. Trypsin, a serine protease, cleaves exclusively at the C-terminal side of lysine (Lys) and arginine (Arg) residues, except when followed by proline, producing peptides with basic C-termini that are amenable to further purification. Chymotrypsin preferentially cleaves after large hydrophobic residues such as phenylalanine (Phe), tyrosine (Tyr), and tryptophan (Trp), though it can also act on leucine (Leu) and methionine (Met) at lower rates, generating aromatic-containing peptides useful for mapping hydrophobic regions. Endoproteinase Glu-C (also known as V8 protease) targets glutamic acid (Glu) residues at the C-terminus, with activity extending to aspartic acid (Asp) under certain pH conditions (e.g., pH 4.0 in phosphate buffer), enabling the production of acidic peptides for complementary coverage. Chemical cleavage methods provide alternatives when enzymatic approaches are insufficient, often targeting less frequent residues for broader fragment spacing. Cyanogen bromide (CNBr) reacts with the sulfur of methionine (Met) residues to cleave at the C-terminal side, converting Met to homoserine lactone and yielding peptides suitable for N-terminal sequencing; this method is particularly effective for proteins with few Met residues, as demonstrated in early structural studies of cytochromes. Endoproteinase Asp-N, a metalloprotease, cleaves on the N-terminal side of aspartic acid (Asp) residues, and to a lesser extent glutamic acid (Glu), producing peptides with Asp at the N-terminus that aid in resolving regions resistant to other cleavages. To reconstruct the complete protein sequence from fragmented peptides, an overlap strategy is employed, involving multiple parallel digests with different enzymes or chemicals to generate sets of peptides that share overlapping sequences. These overlaps allow alignment and assembly, as pioneered in the sequencing of insulin where tryptic and chymotryptic fragments were compared to order the chain. Following digestion, peptides are often separated by gel-based electrophoresis to isolate individual components prior to sequencing; sodium dodecyl sulfate-polyacrylamide gel electrophoresis (SDS-PAGE) resolves peptides by molecular weight under denaturing conditions, while two-dimensional (2D) electrophoresis combines isoelectric focusing with SDS-PAGE for enhanced resolution of complex mixtures. Optimization of fragmentation yield is essential, particularly for proteins with bonds that can hinder access. of cystine (Cys-Cys) bridges using agents like (DTT), followed by alkylation of free thiols with (IAA), unfolds the protein and prevents re-formation of disulfides, ensuring complete digestion and higher sequence coverage in downstream Edman or workflows.

Chemical Reaction Mechanism

The Edman degradation proceeds through a cyclic series of chemical reactions that selectively label, cleave, and identify the N-terminal amino acid of a peptide, enabling sequential sequencing without disrupting the remaining chain. In the initial coupling step, phenylisothiocyanate (PITC) is reacted with the free α-amino group of the N-terminal residue under mildly basic conditions (pH 8–9), typically in a buffered aqueous solution. The nucleophilic nitrogen of the amine attacks the electrophilic central carbon of the isothiocyanate group, forming a stable phenylthiocarbamoyl (PTC) derivative via addition-elimination, with the release of aniline as a byproduct. This step is highly selective for the unprotonated primary amine, minimizing side reactions with other nucleophilic groups in the peptide. Following coupling, the PTC-peptide undergoes cleavage in the presence of (TFA) at room temperature for approximately 10–30 minutes. The acidic conditions protonate the sulfur atom in the PTC group, facilitating an intramolecular nucleophilic attack by the peptide carbonyl oxygen on the PTC carbon, which leads to cyclization and formation of a five-membered thiazolinone ring. This cyclization cleaves the scissile adjacent to the N-terminal residue, releasing the thiazolinone derivative while leaving the shortened intact and ready for the next cycle. The reaction is quantitative under conditions, ensuring minimal of internal bonds. The unstable thiazolinone is then converted to the stable phenylthiohydantoin (PTH) derivative through acid-catalyzed rearrangement, often by brief treatment with aqueous TFA or heating in an acidic medium. This involves ring opening and recyclization, incorporating the side chain of the original amino acid into a thiohydantoin heterocycle that is soluble in organic solvents and amenable to chromatographic identification. The PTH-amino acid is extracted into an organic phase (e.g., ethyl acetate) and analyzed, typically by reverse-phase HPLC, by comparison of retention times with PTH standards derived from known amino acids. The overall process per cycle can be represented as: \text{Peptide-NH}_2 + \text{PITC} \rightarrow \text{PTC-Peptide} \xrightarrow{\text{TFA}} \text{Thiazolinone} + \text{Peptide(-1)-NH}_2 \xrightarrow{\text{aq. acid}} \text{PTH-AA} This mechanism ensures specificity, with one residue released per cycle and yields of 95–99% efficiency, allowing reliable sequencing of up to 50–60 residues before cumulative losses become prohibitive.

Automated Sequencing Instrumentation

Automated protein sequencers for Edman degradation revolutionized the field by enabling high-throughput, reproducible sequencing of peptides and proteins with minimal manual intervention. Early designs, such as the spinning cup sequenator developed by Edman and Begg in 1967, featured a rotating cup where the peptide sample was applied to the inner wall, often coated with polybrene—a quaternary ammonium polymer—to immobilize the peptide and prevent losses during sequential solvent extractions and washes. This liquid-phase system automated the delivery of reagents and collection of fractions, allowing for the processing of up to 50 cycles with initial yields from 10-100 nmol samples, though it suffered from cumulative losses due to the solubility of peptides in organic solvents. Advancements in the addressed these limitations through gas-phase , notably the model 470A sequencer introduced in 1982, which delivered coupling and cleavage reagents (phenylisothiocyanate and ) as vapors to a chamber containing the immobilized . This design minimized issues and losses, enabling sequencing from as little as 1-10 pmol of sample while maintaining high . Subsequent models, such as the 477A released in the mid-, further improved performance by incorporating polybrene-coated filters or discs for sample application, which enhanced adsorption and during the gas-phase s, and allowed for multiple reaction cartridges to support continuous, unattended operation. These supports, often treated with polybrene to promote electrostatic binding, were particularly effective for handling sub-picomole quantities electroblotted from gels. Detection in these automated systems relies on (HPLC) to separate and quantify the phenylthiohydantoin (PTH) derivatives released each cycle, with UV absorbance at 269 nm as the standard detection method for identification against known standards; fluorescence detection has also been integrated in later variants for increased sensitivity down to femtomole levels. Throughput typically supports 30-60 cycles per run, often completed overnight with cycle times of 45-60 minutes, yielding sequences of 20-50 residues depending on sample purity and size. Proprietary software in models like the 477A automates by matching HPLC peak retention times and areas to a of PTH-amino acid standards for residue , while calculating key metrics such as initial (from the first ) and repetitive —typically 95-99% per , reflecting the of successive degradations. For example, a repetitive of ~98% allows reliable sequencing over multiple s before signal becomes prohibitive. These tools also flag or carryover artifacts, ensuring accurate interpretation, and can integrate briefly with upstream fragmentation workflows to extend sequence coverage.

Mass Spectrometry-Based Sequencing

Proteolytic Digestion Strategies

Proteolytic digestion is a cornerstone of the bottom-up approach in mass spectrometry-based protein sequencing, where complex protein mixtures are enzymatically or chemically cleaved into smaller peptides to enhance ionization efficiency and facilitate liquid chromatography-mass spectrometry (LC-MS) analysis. This strategy generates peptides typically 5-50 long, which are more amenable to (MS/MS) fragmentation than intact proteins. Trypsin is the most commonly used enzyme for digestion due to its high specificity, cleaving peptide bonds C-terminal to lysine and arginine residues under neutral pH conditions, producing peptides with basic C-termini that ionize well in positive-ion mode LC-MS. In-gel digestion involves excising protein bands from sodium dodecyl sulfate-polyacrylamide gel electrophoresis (SDS-PAGE), destaining, reducing disulfide bonds, and incubating with trypsin overnight at 37°C, which is particularly useful for separating complex samples prior to analysis. In-solution digestion, performed directly on solubilized proteins, offers higher throughput and is often conducted in urea or guanidine hydrochloride buffers with trypsin-to-protein ratios of 1:20 to 1:50 (w/w) for 4-18 hours at 37°C, improving compatibility with LC-MS workflows. To achieve greater sequence coverage and reduce missed cleavages, multi-enzyme combinations such as followed by are employed; Lys-C specifically cleaves at residues, generating longer peptides that are subsequently refined by trypsin's dual specificity, often increasing protein identifications by 10-20% in complex samples. This sequential digestion minimizes incomplete cleavages at arginine-proline bonds, which trypsin alone may overlook. Chemical adjuncts like (CNBr) provide orthogonal cleavage at residues, complementing enzymatic methods for proteins with low / content or to target specific regions; CNBr reacts in acidic conditions (e.g., 70% ) to form homoserine , yielding peptides suitable for MS when enzymatic coverage is insufficient. Sample preparation prior to digestion includes denaturation with chaotropes like 8 M to unfold proteins, reduction of bonds using (DTT) or tris(2-carboxyethyl)phosphine (), and alkylation of free cysteines with (IAA) at 15-50 mM in the dark to prevent reformation of bridges, ensuring complete accessibility for proteases. Post-digestion cleanup employs C18 tips or spin columns to desalt peptides and remove detergents or salts, concentrating samples in 0.1% for optimal LC-MS loading. These strategies aim for >70% sequence coverage through overlapping peptides from multiple cleavages, enabling robust assembly of full protein sequences via data. The resulting peptide mixtures are then separated by and sequenced by for comprehensive protein identification.

De Novo Peptide Sequencing

De novo peptide sequencing determines the amino acid sequence of directly from () data, independent of reference databases, making it essential for discovering novel proteins or sequences in non-model organisms. This approach typically follows proteolytic of proteins into , which are then separated, ionized, and selected for fragmentation in the second stage of to produce diagnostic fragment ions whose mass-to-charge ratios reveal the sequence through . In tandem MS, fragmentation is achieved via methods such as (CID), higher-energy collisional dissociation (HCD), or electron transfer dissociation (ETD), each generating distinct ion series for sequence inference. CID and HCD primarily cleave the backbone at bonds to yield b-ions (N-terminal fragments retaining the charge) and y-ions (C-terminal fragments), with b-ions often showing further neutral losses like water or ammonia. ETD, in contrast, transfers electrons to multiply charged , producing c-ions (N-terminal) and z-ions (C-terminal) while preserving labile post-translational modifications. The complementary nature of these ions—where the sum of a b-ion and corresponding y-ion masses equals the protonated mass plus 1 Da—enables bidirectional sequencing validation. Sequence reconstruction relies on identifying "ladders" of consecutive fragment ions, where mass differences between adjacent peaks match known amino acid residue masses, such as +57.0215 Da for or +71.0371 Da for . For example, a b-ion series with mass increments of 71 Da and 113 Da would indicate followed by or . These ladders are assembled by aligning observed peaks to theoretical ion positions, accounting for common neutral losses (e.g., -18 Da for H₂O from serine or ). High mass accuracy (e.g., <5 ppm) is crucial for resolving near-isobaric differences, though challenges arise with truly isobaric residues like and (both 113.0841 Da), which require orthogonal methods like MS³ or NMR for distinction. Computational tools automate this process using de novo algorithms that model spectra as graphs, with nodes as possible prefix masses and edges weighted by amino acid probabilities based on ion intensities and cleavage preferences. PEAKS employs a scoring scheme to evaluate fragment ion matches and generate the optimal sequence with confidence tags for variable regions, outperforming earlier tools like Lutefisk on Q-TOF data from tryptic digests. Novor, a real-time alternative, uses dynamic programming for initial ladder building followed by machine learning refinement with decision trees trained on spectral features, achieving 7-37% more correct residues than PEAKS across diverse datasets. A more recent advancement is Casanovo (2024), which reframes de novo sequencing as a sequence-to-sequence translation problem using a transformer model on raw spectral data, achieving an average precision of 0.95 on benchmark datasets, outperforming Novor and other tools by up to 25-37% in peptide identification. Accuracy at the amino acid level reaches 80-95% for short peptides (5-15 residues), where spectra exhibit clearer ion series, but drops for longer sequences due to incomplete fragmentation or spectral noise; HCD often boosts this to ~95% by enhancing higher-energy fragments. Isobaric ambiguities, particularly Leu/Ile, limit full-sequence fidelity to 30-55% without additional resolution. As an illustrative example, consider the MS/MS spectrum of the tryptic peptide SGNFSFQTVK ([M+2H]^{2+} at m/z 557.8). The y-ion ladder includes peaks at m/z 147.1 (y₁: K, 128.095 Da residue), 304.2 (y₂: VK, +99.068 Da for V), 405.2 (y₃: TVK, +101.071 Da for T), and higher ions up to y₁₀ at m/z 1027.5, with differences matching Gln/Lys (128.095 Da), Phe (147.069 Da), Ser (87.032 Da), etc., to read the C-terminus as ...QTVK; complementary b-ions (e.g., m/z 145.1 for b₂: SG, +57.0215 Da for G after S) confirm the N-terminus SGNFSF, yielding the full sequence. Neutral losses like -17 Da (NH₃ from K) or -18 Da (H₂O) annotate side peaks, aiding ladder extension.

Terminal Residue Determination

In mass spectrometry (MS)-based protein sequencing, terminal residue determination focuses on identifying the N- and C-terminal amino acids of peptides or intact proteins through characteristic fragmentation patterns observed in tandem MS (MS/MS) spectra. These modern MS approaches complement classical chemical methods by enabling high-throughput analysis in complex mixtures. For N-terminal identification, immonium ions—low-mass fragments derived from single amino acid residues—serve as diagnostic markers in collision-induced dissociation (CID) MS/MS, providing residue-specific signals that confirm the N-terminal sequence. Additionally, the a-ion series, which are N-terminal fragments resulting from cleavage of the peptide backbone and loss of carbon monoxide from b-ions, further supports N-terminal assignment by forming a ladder of ions that map the sequence from the amino end. C-terminal residues are primarily identified via the y-ion series in MS/MS, where these C-terminal fragments arise from amide bond cleavages and retain the carboxyl terminus, allowing sequential readout of the peptide's end. Exocyclic fragmentation patterns, observed in certain peptide structures, can enhance C-terminal detection by producing side-chain-involved ions that highlight the terminal residue without internal backbone disruption. Isobaric labeling techniques, such as isobaric tags for relative and absolute quantitation (iTRAQ) and tandem mass tags (TMT), improve terminal signal detection by covalently modifying primary amines at the N-terminus (and lysine side chains), which boosts ionization efficiency and identification rates of N-terminal peptides in bottom-up workflows. These labels yield higher peptide-spectrum matches compared to unlabeled samples, facilitating reliable terminal residue confirmation in quantitative proteomics. In top-down MS, electron transfer dissociation (ETD) preserves terminal sequences by generating c- and z-ion series with minimal loss of labile modifications, enabling intact protein analysis up to 80 kDa and providing extensive N- and C-terminal coverage. ETD's non-ergodic fragmentation mechanism ensures that terminal product ions remain intact, supporting precise endpoint sequencing in proteoform characterization. This MS-centric terminal determination is particularly useful for confirming splice variants, where variant-specific termini arise from alternative exon usage, and for verifying post-translational processing events like proteolytic cleavage that alter protein ends. For instance, top-down MS with ETD has been applied to distinguish periostin splice isoforms at the protein level through terminal mass differences. Similarly, it aids in detecting N- or C-terminal proteoforms involved in diverse protein complexes or processing pathways.

Post-Translational Modification Analysis

Post-translational modifications (PTMs) significantly impact protein function and are integral to mass spectrometry (MS)-based protein sequencing, where detection and precise localization within peptide sequences are essential for accurate structural elucidation. In bottom-up proteomics workflows, PTMs are identified by mass shifts in peptide spectra following enzymatic digestion, enabling site-specific mapping when combined with appropriate fragmentation techniques. Common PTMs analyzed in MS include phosphorylation, which introduces a mass shift of approximately +80 Da due to the addition of a phosphate group (HPO₃), glycosylation, which exhibits variable mass increases depending on glycan composition (e.g., +203 Da for an N-acetylhexosamine residue in N-linked forms), and ubiquitination, detected via a +114 Da Gly-Gly remnant on lysine residues after tryptic cleavage of the ubiquitin chain.30248-6/pdf) These modifications are enriched prior to MS analysis to enhance detection sensitivity, as they often occur at low stoichiometry. Localization of PTMs relies on fragmentation methods that preserve modification integrity, such as electron transfer dissociation (ETD) and electron capture dissociation (ECD), which generate intact PTM-peptide fragments (c- and z-type ions) for unambiguous site assignment, particularly for labile modifications like phosphorylation and O-GlcNAc glycosylation. In contrast, collision-induced dissociation (CID) is less suitable for these, as it frequently results in neutral losses (e.g., 98 Da H₃PO₄ from phosphopeptides), complicating localization. Software tools like MaxQuant and Proteome Discoverer facilitate site-specific PTM assignment by calculating localization probabilities based on fragment ion matching and mass accuracy, integrating ETD/ECD data to score potential modification sites with high confidence (e.g., >0.75 probability threshold). These platforms process raw MS spectra to output PTM-localized peptides, supporting integration with sequencing for ambiguous cases. Challenges in PTM analysis include the loss of labile modifications during CID fragmentation and neutral losses that reduce signal intensity, often requiring hybrid fragmentation approaches or enrichment strategies to achieve reliable detection. For quantitative assessment of PTM dynamics, methods such as stable isotope labeling by amino acids in (SILAC) or compare modification abundances across conditions, revealing regulatory changes (e.g., stoichiometry shifts in signaling pathways).

Whole-Protein Mass Measurement

Whole-protein mass measurement, a cornerstone of top-down (MS) in protein sequencing, involves the precise determination of the molecular weight of intact proteins to infer sequence-related features without prior enzymatic digestion. This approach enables the characterization of proteoforms—variants arising from , genetic mutations, or post-translational modifications (PTMs)—by directly ionizing and analyzing the full . Unlike peptide-centric methods, top-down MS preserves the connectivity of modifications across the entire sequence, providing insights into protein heterogeneity that are critical for understanding biological function and disease mechanisms. Ionization of intact proteins is typically achieved using (ESI) or (MALDI), which generate multiply charged ions suitable for mass analysis. ESI is the preferred method for most top-down experiments due to its ability to produce soft ionization of proteins up to approximately 100 , facilitating the transfer of noncovalent complexes into the gas phase while minimizing fragmentation during ionization. MALDI, while effective for larger proteins exceeding 100 , often yields singly charged ions that require high-resolution analyzers to resolve isotopic patterns accurately. These techniques allow for the initial intact mass measurement, serving as the foundation for subsequent sequencing efforts by establishing the baseline molecular weight against which modifications can be mapped. High mass accuracy is essential for distinguishing subtle mass shifts from PTMs or isoforms, with (FT-ICR) and mass analyzers achieving resolutions below 1 . FT-ICR provides parts-per-billion accuracy and ultra-high resolving power, enabling unambiguous assignment of elemental compositions for proteins up to 200 kDa, while systems deliver sub-1 precision in a more compact format suitable for routine laboratory use. In top-down MS/MS, these analyzers facilitate fragmentation of intact ions using methods like electron capture dissociation (ECD) or collision-activated dissociation (CAD), generating sequence tags—short stretches of contiguous fragment ions—that confirm the protein identity and localization of modifications without full de novo sequencing. This fragmentation yields complementary c- and z-type ions that retain labile PTMs, enhancing the reliability of structural annotations. Applications of whole-protein mass measurement include distinguishing protein isoforms with near-identical sequences but differing masses, such as those from variants, and quantifying stoichiometry to assess functional regulation. For instance, top-down has resolved isoforms differing by single substitutions, revealing their roles in epigenetic control, and determined the occupancy of phosphorylations in signaling proteins to quantify activation states. These capabilities are particularly valuable in clinical , where proteoform profiling aids in discovery for diseases like cancer. However, limitations persist, including inefficient fragmentation for proteins larger than 50-70 , where charge state distributions become complex and yield fewer informative sequence tags, often necessitating hybrid approaches or advanced to overcome signal suppression and challenges.

Method Limitations and Challenges

Mass spectrometry-based protein sequencing, while powerful, faces several inherent limitations that can impact its reliability and applicability, particularly in complex biological samples. One primary challenge is , as typical mass spectrometers require at least femtomolar (10^{-15} mol) quantities of protein for reliable detection, corresponding to roughly 10^9 molecules for a 50-kDa protein. In complex mixtures, ion suppression further exacerbates this issue, where co-eluting compounds compete for , reducing signal intensity for low-abundance peptides and leading to under-detection of rare proteoforms. Sequence coverage often remains incomplete, with gaps particularly pronounced in hydrophobic regions such as transmembrane domains of integral membrane proteins. These regions resist proteolytic digestion and exhibit poor , resulting in low recovery during extraction and analysis, sometimes achieving less than 50% coverage in bottom-up workflows. This incomplete coverage hinders full proteoform characterization, especially for multi-spanning membrane proteins critical in cellular signaling. Ambiguities in spectrum interpretation pose another significant hurdle, notably with isobaric residues like and , which share identical masses (113.084 Da) and cannot be distinguished solely by mass-to-charge ratios, leading to potential misassignments in sequencing. Post-translational modifications (PTMs) introduce additional interferences by altering masses and fragmentation patterns, complicating site localization in up to 50% of modified peptides without targeted enrichment. De novo sequencing is computationally demanding and time-intensive compared to database-matching approaches, often requiring extensive processing resources and achieving lower throughput, with costs escalating for large-scale analyses due to the need for high-resolution instruments and software. Database-dependent methods, while faster and more cost-effective, rely on comprehensive reference databases, limiting utility for novel or non-model organisms. To mitigate these challenges, hybrid strategies integrating with have been developed, where Edman provides precise N-terminal sequence validation for short peptides, complementing MS coverage in low-sensitivity scenarios. Such combinations enhance overall accuracy, particularly for resolving ambiguities in therapeutic proteins like monoclonal antibodies.

Sequence Prediction from Nucleic Acids

Translation from DNA Sequences

Translating protein sequences from DNA involves interpreting the genomic sequence through the genetic code, which maps nucleotide triplets (codons) to amino acids. In prokaryotes, this process is relatively straightforward, as genes are often continuous coding sequences without interruptions. However, in eukaryotes, the primary DNA transcript (pre-mRNA) undergoes processing, including the removal of non-coding introns to form mature mRNA, which is then translated into protein. RNA serves as the intermediate for this translation, carrying the genetic information from DNA to the ribosome where amino acids are assembled. This DNA-to-protein inference is fundamental in genomics for predicting proteomes from sequenced genomes. A key step in deducing protein sequences is identifying open reading frames (ORFs), which are stretches of DNA beginning with a start codon (typically ATG, encoding methionine) and ending with a stop codon (TAA, TAG, or TGA), uninterrupted by other stop codons in the reading frame. ORFs are scanned computationally to locate potential protein-coding regions, with tools like NCBI's ORFfinder searching user-input DNA sequences and providing the range and translated protein for each identified ORF. Accurate ORF annotation is crucial for understanding how genetic information translates to functional proteins, as evidenced by studies revealing thousands of novel translated ORFs in human genomes. In eukaryotic genomes, splicing complicates direct translation from DNA, as introns—non-coding sequences interspersed within genes—are precisely excised by the spliceosome, joining exons to form the coding mRNA. This process enables alternative splicing, where different exon combinations produce multiple protein isoforms from a single gene, vastly increasing proteomic diversity; for instance, over 95% of human multi-exon genes undergo alternative splicing. Seminal work has shown that splicing regulation involves cis-acting elements and trans-factors, allowing tissue-specific isoform expression. Failure to account for splicing variants can lead to incomplete or erroneous protein sequence predictions from genomic DNA. The exhibits degeneracy, meaning most are encoded by multiple codons (up to six for some, like ), primarily differing in the third position due to the . Proposed by in 1966, this explains that the third base in a codon-anticodon pairing allows non-standard base pairing (wobble), enabling a single tRNA to recognize multiple synonymous codons and reducing the need for 61 unique tRNAs. This redundancy minimizes the impact of certain mutations but requires careful consideration in sequence prediction to resolve ambiguities. Computational tools facilitate genome-to-protein by generating predictions across all possible reading . The six-frame translation method translates DNA in three forward frames (starting at positions 1, 2, or 3) and three reverse frames (from the complementary strand), helping identify ORFs without prior knowledge of gene orientation; this is implemented in tools like Transeq, which outputs sequences for all six frames. Such approaches are integral to genome annotation pipelines, as seen in comparative analyses where six-frame translations aid in detecting coding regions across species assemblies. Despite these advances, translating from DNA has limitations, as it cannot capture post-translational modifications (PTMs) like or , which alter protein function but occur after translation and are not encoded in the DNA sequence. Additionally, non-coding regions, including regulatory elements and alternative ORFs, influence protein expression and diversity but are not directly reflected in primary sequence predictions. further generates isoforms whose sequences deviate from the genomic template, necessitating RNA-level data for full accuracy. These gaps highlight why DNA-based predictions often require experimental validation through .

Inference from RNA Sequencing

RNA sequencing (RNA-Seq) enables the inference of protein sequences by first capturing the through reverse transcription of RNA into (cDNA), fragmentation, and high-throughput next-generation sequencing (NGS) to generate short reads that represent expressed transcripts. These reads are typically mapped to a or assembly to reconstruct mRNA sequences, with poly-A tail selection during library preparation enriching for mature mRNAs and facilitating identification of 3' untranslated regions (UTRs) during alignment. Once assembled, open reading frames (ORFs) within the mRNA sequences are identified by scanning for start (AUG) and stop codons, followed by into amino acid sequences using the standard , which accounts for and ensures accurate prediction of polypeptide chains from expressed genes. To detect sequence variants that alter protein sequences, RNA-Seq reads are analyzed for single nucleotide polymorphisms (SNPs) in coding regions, where nonsynonymous SNPs can lead to substitutions; for example, tools like GATK or variant callers process aligned reads to identify heterozygous or homozygous variants with high confidence when coverage exceeds 20x depth. fusions, which produce chimeric proteins, are inferred by detecting discordant read pairs or split reads spanning junctions in the , often using algorithms such as STAR-Fusion or that filter for biologically plausible events based on genomic proximity and expression levels. (Ribo-Seq), an extension of , enhances variant detection by sequencing ribosome-protected mRNA fragments, revealing translationally active ORFs including those with SNPs or fusions that affect protein synthesis; this method isolates ~30-nucleotide footprints from translating ribosomes, allowing precise mapping of variant impacts on the . A key advantage of inferring proteins from is its ability to capture alternatively spliced transcripts and low-abundance isoforms that may not be evident in genomic data, as the method directly profiles mature mRNAs and quantifies isoform-specific expression through differential splicing analysis tools like rMATS. This approach reveals dynamic proteomes under specific conditions, such as tissue-specific splicing events that generate protein diversity. However, challenges include RNA instability, which necessitates rapid extraction and RNase-free protocols to prevent degradation and ensure representative sampling of fragile transcripts like those with short half-lives. errors arise from splicing junctions and sequence polymorphisms, potentially leading to misassembled transcripts if short reads fail to span complex exons; mitigation involves splice-aware aligners like HISAT2, though error rates can still exceed 5% in repetitive regions.

Emerging Sequencing Technologies

Nanopore-Based Methods

Nanopore-based methods for protein sequencing rely on the translocation of unfolded proteins or peptides through a nanoscale embedded in a or synthetic membrane that separates two electrolyte-filled compartments. An applied drives the negatively charged polypeptide chain through the pore, where it partially blocks the flow of ions, producing characteristic changes in electrical current that are unique to each based on its size, charge, and hydrophobicity. This single-molecule approach allows for label-free detection without the need for chemical labeling or , enabling analysis of native protein sequences. Oxford Nanopore Technologies has adapted its nucleic acid sequencing platform for protein analysis by engineering biological nanopores, such as modified protein channels, to accommodate bonds and generate -specific signals. In their -based detection workflow, proteins are first digested into shorter , which are then attached to DNA handles and translocated using a , producing current blockades that reflect the 's composition for and barcoding. Engineered pores like the glycine-to-phenylalanine substituted Fragaceatoxin C (G13F-FraC-T1) enhance capture and , allowing differentiation of with approximately 40 and detection of small chemical modifications at low to minimize electroosmotic flow. These adaptations enable protein through protease-generated spectra, akin to fingerprinting but at the single-molecule level. As of 2025, significant advances include the integration of proteins, such as helicase-like enzymes, to control the rate of protein unfolding and feeding into the , improving signal resolution for longer chains. Oxford Nanopore's roadmap outlines progress toward full-length protein sequencing without digestion, with prototypes demonstrating detection in complex samples and proteoform analysis. Accuracy for individual identification reaches up to 98.8% in optimized setups, while short chains achieve around 90% sequencing fidelity, though challenges persist in distinguishing similar like and . These methods complement by offering direct, single-molecule insights into protein dynamics. The primary advantages of nanopore-based protein sequencing are its label-free nature, real-time readout, and potential for portable, high-throughput applications in discovery and variant screening. However, key challenges include efficient protein unfolding, especially for folded domains, and achieving sufficient current blockade distinction for all 20 in longer sequences, limiting current use to short peptides or digested samples.

Single-Molecule Recognition Techniques

Single-molecule recognition techniques represent an emerging class of methods for protein sequencing that rely on detecting unique electronic or optical signatures of individual (AAs) without enzymatic fragmentation, enabling potential sequencing of intact polypeptides. These approaches leverage nanoscale devices to probe AAs at the single-molecule level, offering high sensitivity for distinguishing subtle structural differences, including post-translational modifications (PTMs).

Fluorosequencing

Fluorosequencing is an optical single-molecule method that combines chemistry with fluorescent labeling and high-throughput to determine sequences. Specific , such as , , and others, are labeled with distinct fluorophores that are stable and brightly emissive. Cyclic rounds of selectively remove and cleave the N-terminal residue, releasing a fluorescent tag whose color and position are imaged using () on a chip with millions of immobilized . This enables parallel sequencing of thousands to millions of per run, with read lengths up to 50 residues. Developed initially at the University of Texas and advanced by companies like Erisyon, fluorosequencing supports sequencing without genomic references and detects PTMs through altered fluorescence or cleavage patterns. As of 2025, improvements include brighter, more photostable dyes and automated , achieving single-molecule down to attomolar concentrations and integration with probabilistic models for abundance inference in complex proteomes. Advances also encompass expanded labeling of up to 10 types, enhancing coverage for proteins with sparse target residues. Key advantages include high parallelism comparable to next-generation , direct localization, and compatibility with low-input samples like single cells. Challenges involve incomplete labeling of all 20 (typically 4-6 types per run), potential residue-specific biases in cleavage efficiency, and the need for sophisticated image analysis to handle tag release. These limitations restrict full coverage but position fluorosequencing as a complementary tool for targeted .

Recognition Tunneling

Recognition tunneling () employs a pair of nanoelectrodes separated by a narrow gap (approximately 1-2 ) functionalized with molecules, such as imidazole-carboxamide, that selectively bind to specific . As a polypeptide translocates through the gap—often via controlled pulling or —the transverse quantum tunneling current modulates in a characteristic manner for each AA due to its unique electronic orbital overlap with the recognition layer. This signature is measured and decoded using algorithms to identify the AA with over 95% accuracy for many residues. Early demonstrations distinguished all 20 proteinogenic AAs, including isobaric isomers like and , and detected PTMs such as on serine. The technique avoids mass spectrometry's need for ionization and fragmentation, preserving sequence context in native-like conditions. However, practical implementation requires stable nanojunction fabrication and mitigation of noise from aqueous environments or contaminants. Recent prototypes have explored graphene-based electrodes to enhance gap uniformity and signal-to-noise ratios, achieving discrimination of structurally similar AAs like serine and .

DNA-PAINT Variants

Variants of DNA point accumulation in nanoscale topology (DNA-PAINT) adapt super-resolution fluorescence microscopy for protein sequencing by using fluorophore-labeled DNA probes that transiently bind to specific AA tags or epitopes on immobilized polypeptides. In quantitative PAINT (qPAINT), the binding kinetics of probes to or residues provide a readout proportional to AA abundance, enabling compositional "fingerprinting" for protein identification. Discrete molecular imaging (DMI), a DNA-PAINT extension, unfolds proteins on a surface and images probe binding sites with sub-5 nm , allowing partial sequence inference from spatially resolved tag patterns. These methods excel in multiplexing, with orthogonal DNA sequences enabling simultaneous detection of multiple AA types, and have identified over 50% of the human proteome via labeling alone. involves chemical tagging and surface tethering, which can introduce biases but supports high-throughput analysis on substrates. As of 2025, advances in RT include gold nanojunctions that distinguish all 20 AAs plus phosphorylated , , and variants through enhanced tunneling signals, with nanojunction gaps of 5 providing sixfold higher sensitivity than larger designs. Integration efforts with (AFM) for precise translocation control have improved resolution, while optimized setups report sequencing speeds approaching 10 AAs per second in controlled prototypes. DNA-PAINT variants have seen speed enhancements via left-handed DNA sequences, enabling 10-plex imaging for faster tag readout. Key advantages of these techniques include label-free (for RT) or minimally invasive optical detection, enabling intact protein analysis without digestion, and potential for PTM localization. Limitations encompass the need for surface , which risks conformational artifacts, and challenges in achieving long-read sequences beyond 50-100 AAs due to junction instability or probe off-rates. Compared to parallel approaches like methods, single-molecule recognition emphasizes static binding signatures over dynamic translocation currents.

Bioinformatics Tools

Sequence Alignment and Assembly

In protein sequencing, fragmented data from techniques such as (MS/MS) often requires computational assembly to reconstruct full-length sequences, particularly when dealing with unknown or novel proteins. Sequence alignment and assembly processes integrate overlapping fragments—derived from enzymatic or partial —into contiguous sequences, while alignment tools compare these to known homologs for validation and extension. These methods are essential for sequencing scenarios where no reference database is available, enabling the elucidation of protein primary structures from raw spectral data. De novo assembly of protein sequences typically employs an overlap-layout-consensus (OLC) paradigm, where short ladders or tags are first aligned based on sequence overlaps, then laid out into a scaffold, and finally refined via consensus to resolve ambiguities. This approach is particularly suited to peptide fragments generated from /, as it handles variable-length overlaps from enzymatic cleavages like digestion. For instance, the software facilitates reconstructive searches by incorporating sequenced tags with tolerated errors, allowing assembly of partial peptides into longer contigs through iterative overlap matching. Similarly, tools like apply OLC to short peptide sequences (6-100 ), using hash tables and prefix trees to efficiently identify and merge overlaps, achieving contigs with high identity to reference proteins in datasets. Once assembled, sequences undergo to detect homologies and refine boundaries. BLASTP (Basic Local Alignment Search Tool for proteins) performs local alignments by identifying high-scoring segment pairs between query peptides and database entries, using a like BLOSUM62 to score matches while penalizing gaps and mismatches. For , Clustal Omega employs progressive alignment strategies with guide trees derived from pairwise distances, enabling robust homology matching across related protein fragments and revealing conserved regions that inform assembly decisions. These tools enhance accuracy by cross-validating de novo contigs against evolutionary relatives. Error correction is integrated throughout assembly and alignment to mitigate inaccuracies from noisy MS data or degradation inefficiencies. In MS-based workflows, peptide-spectrum match (PSM) scores and de novo sequencing confidence metrics—such as those from spectral intensity correlations—are weighted to prioritize reliable overlaps, with algorithms discarding low-scoring fragments or applying voting schemes across redundant peptides. For Edman degradation-derived sequences, repetitive yield measurements (typically 95-99% per cycle) provide quantitative error estimates, allowing correction by adjusting for cumulative signal loss in ladder alignments. Hybrid methods combine these, using MS scores to validate Edman-derived residues and vice versa, reducing false positives in contig formation. For short-read data, graph-based models like de Bruijn graphs offer an alternative to OLC by representing peptides as k-mers (overlapping subsequences of length k) and resolving Eulerian paths through the graph to reconstruct sequences. Weighted variants incorporate edge scores from intensities or alignment probabilities, as seen in antibody sequencing pipelines where de Bruijn graphs assemble variable regions from fragmented peptides with improved contiguity over traditional methods. This is computationally efficient for high-throughput data but requires careful k-mer selection to balance coverage and error propagation. The output of these processes is typically formatted as files, where assembled sequences are presented alongside headers containing metadata like confidence scores (e.g., average PSM or yield-adjusted probabilities) and assembly statistics such as contig length and overlap coverage. These scores, often derived from log-likelihood ratios or posterior probabilities, enable downstream assessment of reliability, with thresholds above 95% commonly used for biomedical applications.

Database Resources and Software

UniProt serves as a central repository for curated protein sequences and functional annotations, compiling data from various sources including nucleotide sequence translations and direct protein sequencing efforts. It provides access to over 199 million protein sequences in UniProtKB, with detailed annotations on function, interactions, and modifications, enabling researchers to retrieve sequences in formats like for downstream analysis. The (PDB) maintains a collection of three-dimensional protein structures determined experimentally, with associated primary sequences that link structural data to sequence information for over 200,000 entries as of 2025. These structure-linked sequences facilitate studies on and interactions, often integrated with tools for sequence visualization and alignment. RefSeq, curated by the National Center for Biotechnology Information (NCBI), offers a non-redundant set of reference protein sequences derived from genomic annotations, transcripts, and proteins, totaling millions of entries that provide stable identifiers for genome annotation and comparative genomics. These sequences are generated through automated and manual curation to ensure consistency and biological relevance. For identifying proteins from mass spectrometry (MS) data, search tools like and SEQUEST perform peptide-spectrum matching against these databases. , a widely used engine, compares experimental /MS spectra to theoretical fragments from sequence databases, scoring matches based on mass accuracy and fragmentation patterns to achieve high-confidence identifications. Similarly, SEQUEST employs algorithms to align observed tandem spectra with predicted peptides, enabling the identification of proteins in complex mixtures with sensitivity for low-abundance species. Prediction software complements experimental sequencing by inferring structural and modification features from primary sequences. , developed by DeepMind, uses to predict three-dimensional protein structures directly from sequences, achieving near-experimental accuracy for many targets and revolutionizing since its 2021 release. SignalP predicts signal peptides—N-terminal sequences directing protein localization and often subject to cleavage as a post-translational processing event—using models trained on diverse eukaryotic and prokaryotic proteins, with versions up to 6.0 handling metagenomic data. Proteomics pipelines such as MaxQuant integrate these databases and tools for end-to-end analysis, processing raw data through identification via embedded search engines like , followed by quantification and annotation by querying or for sequence validation. This workflow supports label-free and labeled experiments, linking spectral matches to database entries for comprehensive protein profiling. As of 2025, database resources have expanded to incorporate data from next-generation protein sequencing (NGPS) technologies, enhancing coverage of proteoforms and post-translational modifications in repositories like and dbPTM. These updates, including over 2.79 million PTM sites in dbPTM, reflect growing integration of high-throughput sequencing outputs for improved sequence diversity and functional insights.

Applications

Biomedical and Proteomic Uses

Protein sequencing plays a pivotal role in by enabling the precise characterization of therapeutic proteins such as monoclonal antibodies (mAbs) and enzymes. sequencing via (MS) allows for the direct determination of antibody sequences from purified products, facilitating the identification of sequence variants and optimization for therapeutic efficacy. For instance, this approach has been applied to sequence intact IgG antibodies, revealing subtle modifications that impact binding affinity and stability in pipelines. Similarly, sequencing enzymes aids in engineering variants with enhanced catalytic properties for targeted therapies. In disease proteomics, protein sequencing through MS has revolutionized the identification of cancer biomarkers, particularly in phosphoproteomics, which maps phosphorylation sites to uncover dysregulated signaling pathways. High-resolution MS techniques enable the profiling of thousands of phosphosites in tumor samples, highlighting alterations in kinases and downstream effectors that serve as diagnostic or prognostic markers. For example, phosphoproteomic analyses of cancer cell lines and tissues have identified novel therapeutic targets by revealing hyperphosphorylated proteins associated with tumor progression and . Protein sequencing supports by characterizing (HLA) peptidomes, which are critical for design. MS-based immunopeptidomics sequences HLA-presented peptides on tumor cells, identifying neoantigens that can be targeted by T-cell therapies or vaccines tailored to individual HLA alleles. This has advanced precision , where sequencing the HLA ligandome helps predict immune responses and select patients for checkpoint inhibitors or adoptive therapies. Comparative protein sequencing contributes to evolutionary studies by enabling through alignment of sequences across , reconstructing evolutionary relationships and functional divergences. MS-driven sequencing of ancient or low-abundance proteins, combined with bioinformatics, has resolved phylogenetic trees for protein families, shedding light on and in biological systems. High-throughput next-generation protein sequencing (NGPS) technologies are emerging as key tools in single-cell proteomics, allowing the analysis of proteomes at cellular resolution to study heterogeneity in diseases like cancer. As of 2025, platforms using single-molecule fluorescence detection sequence thousands of peptides per cell, enabling the mapping of protein expression variations that inform tumor microenvironments and therapeutic resistance. These advances, supported by bioinformatics tools for , promise deeper insights into cellular dynamics.

Cryptographic and Novel Applications

Protein sequences have been proposed as robust keys for encryption due to their high variability and complexity, leveraging the 20 standard amino acids to generate unique, hard-to-crack codes. In the 2000s, early proposals explored amino acid variability in the context of biological computing, extending DNA cryptography principles to protein-level encoding by mapping codons to amino acids for secure data transformation. More recent advancements, such as proteinoid assemblies, utilize the self-organizing electrical properties of amino acids to create bio-inspired encryption schemes resistant to traditional brute-force attacks. A 2025 introduction to protein cryptography formalizes encoding data directly into amino acid sequences, enabling storage and transmission within synthetic proteins that can be sequenced for decryption. In , protein sequencing provides essential feedback for iterative design of novel proteins with tailored functions, allowing researchers to verify structural outcomes and refine sequences through cycles of synthesis and analysis. protein design tools now integrate sequencing data to create unprecedented folds and assemblies, such as modular scaffolds for enzymatic applications, enhancing the precision of engineering beyond natural evolution. This feedback loop has enabled the development of customizable proteins for diverse biotechnological uses, including biosensors and therapeutic agents. Protein sequencing supports forensic applications by identifying species through variant peptide markers and profiling individuals via proteomic signatures from degraded samples where DNA is unavailable. For instance, mass spectrometry-based sequencing detects single amino acid polymorphisms in body fluids or tissues, enabling ethnic group determination and post-mortem interval estimation. In species identification, metaproteomic analysis distinguishes microbial or animal origins in trace evidence, aiding wildlife crime investigations. Environmental applications of protein sequencing include metaproteomics, which profiles functional proteins in microbiomes to assess and microbial dynamics without relying on genomic inference. In soil and water samples, sequencing reveals active enzymatic pathways in microbial communities, informing strategies and monitoring. Recent ultra-sensitive metaproteomic methods have expanded this to low-biomass environments, quantifying protein functions in complex consortia for climate impact studies.

References

  1. [1]
    Paving the way to single-molecule protein sequencing - Nature
    Sep 6, 2018 · Proteins are major building blocks of life. The protein content of a cell and an organism provides key information for the understanding of ...
  2. [2]
    Protein Sequencing, One Molecule at a Time - PMC - NIH
    Historically, Edman sequencers were applied only to purified (homogenous) proteins, so the detached amino acid derivatives could then be identified by bulk ...
  3. [3]
    1 Emerging Protein Sequencing Technologies: Proteomics without ...
    Highlights. • Liquid chromatography-tandem mass spectrometry has been the leading technology for proteomics for nearly three decades.
  4. [4]
    Emil Fischer – Biographical - NobelPrize.org
    In 1901 he discovered, in collaboration with Fourneau, the synthesis of the dipeptide, glycyl-glycine and in that year he also published his work on the ...Missing: composition | Show results with:composition
  5. [5]
    [PDF] One Hundred Years of Peptide Chemistry
    In 1901, Emil Fischer (with E Fourneau) published an article which reports the preparation of the first dipeptide, glycylglycine, obtained ~y partial ...
  6. [6]
    Sequencing proteins: Insulin - What is Biotechnology
    Sanger described the process like piecing together a jig-saw. His technique would later be called the degradation or DNP method. The novelty of Sanger's ...
  7. [7]
    [PDF] Frederick Sanger - The chemistry of insulin - Nobel Prize
    In the original work on insulin, silica-gel chromatography was used, though more recently other systems, particularly paper chromatography, have been found more ...Missing: 1950s | Show results with:1950s
  8. [8]
    On 'A method for the determination of amino acid sequence in ... - NIH
    Edman degradation, the first method to determine the amino acid sequence of a peptide, was published in 1949 in the Archives of Biochemistry.
  9. [9]
    HUPO Proteomics Timeline
    Its first use in protein sequencing was in 1966 when Biemann and his colleagues successfully sequenced several oligopeptides containing glycine, alanine, serine ...
  10. [10]
    History and Trends in Protein Sequencing | MtoZ Biolabs
    Edman Degradation Era (1950s-1980s). The chemical sequencing method introduced by Pehr Edman, which sequentially cleaves the N-terminal amino acids and ...
  11. [11]
    Edman Method (Protein Sequencer) | [Analytical Chemistry]Products
    In the 1980s, the first Japanese gas-phase protein sequencer PSQ-1 was developed to automatically perform Edman degradation in the gas phase and analyze amino ...Missing: history | Show results with:history<|separator|>
  12. [12]
    Biochemistry, Primary Protein Structure - StatPearls - NCBI Bookshelf
    Oct 31, 2022 · To reiterate, the primary structure of a protein is defined as the sequence of amino acids linked together to form a polypeptide chain.
  13. [13]
    Protein Structure | Learn Science at Scitable - Nature
    The linear sequence of amino acids within a protein is considered the primary structure of the protein. Proteins are built from a set of only twenty amino ...<|control11|><|separator|>
  14. [14]
    Genetic Code - National Human Genome Research Institute
    in various ways to spell out three-letter “ ...
  15. [15]
    Biochemistry, Essential Amino Acids - StatPearls - NCBI Bookshelf
    Apr 30, 2024 · Among these 20 amino acids, 9 are essential—phenylalanine, valine, tryptophan, threonine, isoleucine, methionine, histidine, leucine, and lysine ...Introduction · Fundamentals · Molecular Level · Mechanism
  16. [16]
  17. [17]
    Learning the protein language: Evolution, structure, and function
    Jun 16, 2021 · Understanding the sequence-structure-function relationship is the central problem of protein biology and is pivotal for understanding disease ...Introduction · Protein Language Models... · Star Methods
  18. [18]
    Gene Mutations in Human Hæmoglobin: the Chemical Difference ...
    Gene Mutations in Human Hæmoglobin: the Chemical Difference Between Normal and Sickle Cell Hæmoglobin. V. M. INGRAM. Nature volume 180, pages 326–328 (1957)Cite ...
  19. [19]
    Protein Sequencing: Significance, Methods, and Applications
    Protein sequencing is instrumental in drug discovery and development. By elucidating the primary structure of target proteins, researchers can design molecules, ...
  20. [20]
    Advances in protein sequencing: Techniques, challenges and ...
    Advancement in high-throughput protein sequencing techniques. •. Sequencing ... 1990s with the advances in mass spectrometry-based proteomics [9]. In ...
  21. [21]
    Overview of Post-Translational Modifications (PTMs)
    Technically, the main challenges to studying post-translationally modified proteins are the development of specific detection and purification methods.
  22. [22]
    TTN - Titin - Homo sapiens (Human) | UniProtKB | UniProt
    Apr 18, 2012 · Homo sapiens (Human). Amino acids. 34350 (go to sequence). Protein existence. Evidence at protein level. Annotation score. 5/5. Entry · Variant ...
  23. [23]
    De Novo Protein Sequencing vs DNA Sequencing - Rapid Novor
    Aug 1, 2021 · De novo protein sequencing is the method in which the amino acid sequence of a protein is directly determined without prior knowledge of its DNA ...
  24. [24]
    Advances in protein structure prediction and design - Nature
    Aug 15, 2019 · In this Review, we describe current approaches for protein structure prediction and design and highlight a selection of the successful applications they have ...
  25. [25]
    Accurate and efficient amino acid analysis for protein quantification ...
    May 11, 2019 · The AccQ-Tag method is typically used to establish relative amino acid composition and involves several preparations (using different hydrolysis ...
  26. [26]
    Tryptophan determination of food proteins by h.p.l.c. after alkaline ...
    Hydrolysis with either LiOH or NaOH gave similar results. Tryptophan values and the recovery of added 5-methyltryptophan were similar when hydrolysis was made.
  27. [27]
    Protein hydrolysis for amino acid analysis revisited - PubMed
    Sep 6, 2025 · We compared six different hydrolysis methods for chromatography-based amino acid analysis of plant-based food matrices, including oat, pea, and ...Missing: techniques | Show results with:techniques
  28. [28]
    Reproducible microwave-assisted acid hydrolysis of proteins using ...
    A new set-up for microwave-assisted acid hydrolysis (MAAH) with high efficiency and reproducibility to degrade proteins into peptides for mass spectrometry ...
  29. [29]
    Introducing protein deamidation: Landmark discoveries, societal ...
    Deamidation, isomerization and racemization are three prevalent protein degradation mechanisms at physiological pH and temperature, but it is important to note ...
  30. [30]
    [PDF] Stanford Moore and William H. Stein - Nobel Lecture
    Chromatographic analysis of a mixture of amino acids automatically recorded in 22 hours by the equipment shown in Fig. 4. From (26). quantitative amino acid ...
  31. [31]
    Ion Exchange Chromatography of Amino Acids. A Single Column ...
    Ion Exchange Chromatography of Amino Acids. A Single Column, High Resolving, Fully Automatic Procedure.
  32. [32]
    comparison to cation exchange with post-column ninhydrin detection
    Ion-exchange chromatography with ninhydrin detection remains the gold standard for detecting inborn errors of amino acid catabolism and transport.Missing: analyzer | Show results with:analyzer
  33. [33]
  34. [34]
    Accurate and efficient amino acid analysis for protein quantification ...
    May 11, 2019 · The studies indicate that hydrolysis is complete (86–103%) and that protein can be accurately quantified with the prescribed isotopic dilution- ...
  35. [35]
    Use of UPLC-ESI-MS/MS to quantitate free amino acid ...
    Nov 20, 2013 · We used reversed phase Ultra Performance Liquid Chromatography (UPLC) coupled to electrospray ionization tandem mass spectrometry (ESI-MS/MS) technique for FAA ...
  36. [36]
    High-Speed Quantitative UPLC-MS Analysis of Multiple Amines in ...
    Jan 19, 2017 · A targeted reversed-phase gradient UPLC-MS/MS assay has been developed for the quantification /monitoring of 66 amino acids and amino-containing compounds in ...
  37. [37]
    [PDF] Amino acid analysis refers to the methodology used to determine the ...
    Hydrolysis Solution: 6 N hydrochloric acid containing 0.1% to 1.0% of phenol, to which DMSO is added to obtain a final concentration of 2% (v/v). Vapor Phase ...
  38. [38]
    UNIT 11.10 N-Terminal Sequence Analysis of Proteins and Peptides
    However, N-terminal sequencing remains the method of choice for verifying the N-terminal boundary of recombinant proteins, determining the N-terminal of ...
  39. [39]
    Enzymatic approaches for obtaining amino acid sequence: on-target ...
    Several enzymatic approaches have proven useful for identifying the N- and C-terminal residues. They involve the use of carboxypeptidases and ...
  40. [40]
    Enzymatic Properties of Human Aminopeptidase A: REGULATION ...
    Aminopeptidases hydrolyze N-terminal amino acids of proteins and peptide substrates. They are distributed widely in animal and plant tissues as well as in ...
  41. [41]
    Mascot help: Peptide fragmentation - Matrix Science
    Fragments will only be detected if they carry at least one charge. If this charge is retained on the N terminal fragment, the ion is classed as either a, b or c ...
  42. [42]
    N-Terminal Protein Characterization by Mass Spectrometry Using ...
    A sample-preparation method for N-terminal peptide isolation from protein proteolytic digests has been developed.
  43. [43]
    [PDF] Method for Determination of the Amino Acid
    The details of the method are given and its applications to a tripeptide and a tetrapeptide are described. The applicability of the method is briefly discussed.
  44. [44]
    Key Pain Points in Amino Acid Sequencing & How to Avoid Them
    Aug 13, 2021 · Up to 50% of eukaryotic proteins have N-terminal blockages. Figure 1: Schematic Diagram of N-terminal Sequence Analysis by Edman Degradation.<|control11|><|separator|>
  45. [45]
    N and C Terminal Amino Acid Sequence Analysis - BioPharmaSpec
    N Terminal Sequencing Applications · Showing that the N-terminus of your protein is intact and as expected. · Demonstrating batch-to-batch consistency.
  46. [46]
    Carboxypeptidase A - Worthington Enzyme Manual
    Carboxypeptidase A (CPDA) is a pancreatic metalloexopeptidase that hydrolyzes the peptide bond adjacent to the C-terminal end of a polypeptide chain.
  47. [47]
    Recombinantly expressed carboxypeptidase B and purification thereof
    Because of its high specificity for C-terminal basic amino acids, carboxypeptidase B has found wide use, e.g., in end-group analysis for sequence determination.<|separator|>
  48. [48]
    C-Terminal Sequence Analysis with Carboxypeptidase Y
    To date there is no chemical method that provides the ability to determine extensive lengths of amino acid sequence sequentially at the C-terminus of a protein ...Missing: historical | Show results with:historical
  49. [49]
    Microwave enhanced Akabori reaction for peptide analysis - PubMed
    The Akabori reaction, devised in 1952 for the identification of C-terminus amino acids, involves the heating of a linear peptide in the presence of anhydrous ...
  50. [50]
    C-Terminal Ladder Sequencing via Matrix-Assisted Laser ...
    C-Terminal Ladder Sequencing via Matrix-Assisted Laser Desorption Mass Spectrometry Coupled with Carboxypeptidase Y Time-Dependent and Concentration-Dependent ...Missing: historical | Show results with:historical
  51. [51]
    B2. Sequence Determination Using Mass Spectrometry
    May 8, 2019 · Ions with the original N terminus are denoted as a, b, and c, while ions with the original C terminus are denoted as x, y, and z. c and y ions ...
  52. [52]
    C-terminal ladder sequencing of peptides using an alternative ...
    C-terminal ladder sequencing of peptides using an alternative nucleophile in carboxypeptidase Y digests ... C-terminal tag with bovine carboxypeptidase A. Journal ...
  53. [53]
    Selective Chemical Cleavage Methods in Proteomics, Including C ...
    Akabori et al. (5) used anhydrous hydrazine for protein C-terminal determination. In this reaction, internal peptide bonds are hydrazinolyzed, yielding the ...
  54. [54]
    Proteomics beyond trypsin - Tsiatsiani - 2015 - The FEBS Journal
    Mar 30, 2015 · Here, we describe some of the shortcomings of the nearly exclusive use of trypsin in proteomics and review the properties of other proteomics-appropriate ...
  55. [55]
    Trypsin Cleaves Exclusively C-terminal to Arginine and Lysine ...
    Trypsin cleaves solely C-terminal to arginine and lysine. We find that non-tryptic peptides occur only as the C-terminal peptides of proteins.
  56. [56]
    Mapping specificity, cleavage entropy, allosteric changes ... - Nature
    Mar 16, 2021 · Protease cleavage specificity was inferred by comparing the observed frequency with a random (null) distribution generated from the database and ...
  57. [57]
    Using Endoproteinases Asp-N and Glu-C to Improve Protein ...
    Glu-C cleaves at the C-terminus of glutamic and aspartic residues (4–6). Due to their specific cleavage sites, these proteinases create unique peptide fragments ...Abstract · Introduction · Asp-N, Sequencing Grade · Glu-C, Sequencing Grade
  58. [58]
    SELECTIVE CLEAVAGE OF THE METHIONYL PEPTIDE BONDS IN ...
    Cyanogen bromide cleavage of proteins in salt and buffer solutions. Analytical Biochemistry 2010, 407 (1) , 144-146. https://doi.org/10.1016/j.ab.2010.07 ...
  59. [59]
    Specificity of Endoproteinase Asp-N (Pseudomonas Fragi) - PubMed
    Endoproteinase Asp-N, a metalloprotease from a mutant strain of Pseudomonas fragi, has been reported to specifically cleave on the N-terminal side of aspartyl ...
  60. [60]
    [PDF] Peptide Sequencing by Edman Degradation
    The strategy of Sanger and colleagues for the sequencing of insulin was to characterize series of small overlapping peptides produced by cleavage of the parent ...
  61. [61]
    Peptide mapping and microsequencing of proteins separated by ...
    A method is described for the isolation of peptide fragments from proteins separated by polyacrylamide gel electrophoresis.Missing: 2D | Show results with:2D
  62. [62]
    Evaluation and optimization of reduction and alkylation methods to ...
    A typical workflow for bottom-up proteomics includes the reduction of disulfide bonds and the alkylation of sulfhydryl groups.
  63. [63]
  64. [64]
    Peptides and Proteins - MSU chemistry
    The products of the Edman degradation are a thiohydantoin heterocycle incorporating the N-terminal amino acid together with a shortened peptide chain. Amine ...<|control11|><|separator|>
  65. [65]
    CHEM 440 - Lecture 7
    Sep 19, 2016 · Instead, a "divide and conquer" strategy is used, by which smaller peptide fragments are produced, followed by sequencing of the fragments by ...<|separator|>
  66. [66]
    A Protein Sequenator
    The protein sequenator is an instrument for the automatic determination of amino acid sequences in proteins and peptides. It operates on the principle of ...
  67. [67]
    A gas-liquid solid phase peptide and protein sequenator - PubMed
    A gas-liquid solid phase peptide and protein sequenator. J Biol Chem. 1981 Aug 10;256(15):7990-7. Authors. R M Hewick, M W Hunkapiller, L E Hood, W J Dreyer.Missing: microsequencer | Show results with:microsequencer
  68. [68]
    N‐Terminal Sequence Analysis of Proteins and Peptides - 2009
    Aug 1, 2009 · Hewick, R.M., Hunkapiller, M.W., Hood, L.E., and Dryer, W.J. 1981. A gas-liquid solid phase peptide and protein sequencer. J. Biol. Chem ...
  69. [69]
    Attomole level protein sequencing by Edman degradation ... - PNAS
    Edman degradation remains the primary method for determining the sequence of proteins. In this study, accelerator mass spectrometry was used to determine ...
  70. [70]
    A Critical Review of Bottom-Up Proteomics: The Good, the Bad ... - NIH
    We aim to describe a bottom-up proteomics workflow from sample preparation to data analysis, including all of its benefits and pitfalls.
  71. [71]
    Comprehensive Overview of Bottom-Up Proteomics Using Mass ...
    Jun 4, 2024 · Bottom-up proteomic strategies rely on efficient digestion of proteins into peptides for mass spectrometry anal. In-soln. and filter-based ...
  72. [72]
    Protease Digestion for Mass Spectrometry | Protein Digest Protocols
    The use of trypsin in bottom-up proteomics may impose certain limits in the ability to grasp the full proteome. Tightly-folded proteins can resist trypsin ...
  73. [73]
    Comparison of in-gel and in-solution proteolysis in the proteome ...
    Nov 15, 2023 · The objective of this study was to assess two bottom-up proteomics workflows for the extraction of tryptic peptides from the perfusate.
  74. [74]
    Bottom-Up Proteomics: Advancements in Sample Preparation - MDPI
    In this review, we have outlined the current methods used for sample preparation in proteomics, including on-membrane digestion, bead-based digestion, ...
  75. [75]
    Multiple enzymatic digestion for enhanced sequence ... - PubMed
    Multiple enzyme digests (trypsin, Lys-C, Asp-N) increase protein sequence coverage. Trypsin and Lys-C detect distinct protein sets, increasing protein and ...
  76. [76]
    Multiple-Enzyme-Digestion Strategy Improves Accuracy and ...
    Oct 18, 2018 · Whole liver SDS lysate was processed with FASP using successive digestion with LysC and trypsin. The resulting digests were spiked with stable ...Introduction · Experimental Section · Results · Supporting Information
  77. [77]
    Enhancement of cyanogen bromide cleavage yields for methionyl ...
    Jan 1, 1999 · Cyanogen bromide (CNBr) is a common chemical used to hydrolyze peptide bonds C-terminal to methionine residues in peptides and proteins.
  78. [78]
    Systematic Evaluation of Protein Reduction and Alkylation Reveals ...
    In this study, we compared common reduction reagents (dithiothreitol, tris-(2-carboxyethyl)-phosphine, and β-mercaptoethanol) and alkylation reagents.
  79. [79]
    C18 Columns and Peptide Desalting for Mass Spectrometry
    The C18 matrix is the most ideal for the capture of hydrophobic peptides. The peptides bind to reverse-phase columns in high-aqueous mobile phase.
  80. [80]
    Lessons in de novo peptide sequencing by tandem mass spectrometry
    Oct 29, 2013 · Some of these fragments may be “odd-electron” radical ions, which are formed in ECD/ETD processes and in high-energy CID (Table 2). The ...
  81. [81]
    LESSONS IN DE NOVO PEPTIDE SEQUENCING BY TANDEM ...
    The raw data in these studies are MS/MS spectra, usually of peptides produced by proteolytic digestion of a protein. These spectra are “translated” into peptide ...
  82. [82]
    PEAKS: Powerful Software for Peptide De Novo Sequencing by MS ...
    In this communication, we describe a new de novo sequencing software package, PEAKS, to extract amino acid sequence information without the use of databases.
  83. [83]
    Novor: Real-Time Peptide de Novo Sequencing Software - PMC - NIH
    Jun 30, 2015 · This study presents a new software tool, Novor, to greatly improve both the speed and accuracy of today's peptide de novo sequencing analyses.
  84. [84]
    Uncovering Thousands of New Peptides with Sequence-Mask ...
    Here, we develop SMSNet, a deep learning-based de novo peptide sequencing framework that achieves >95% amino acid accuracy while retaining good identification ...
  85. [85]
    Ion Activation Methods for Peptides and Proteins - PMC - NIH
    In general, fragment ions that retain the N-terminus of the polypeptide are referred to as a, b and c-ions, whereas product ions that retain the C-terminus of ...
  86. [86]
    Electron Transfer Dissociation Mass Spectrometry in Proteomics - NIH
    Electron transfer dissociation (ETD) is emerging as a complementary method for characterization of peptides and post-translational modifications (PTMs).
  87. [87]
    Tandem mass spectrometry for the structural determination of ...
    The presence of other functional groups, such as an exocyclic N-terminal residue, however, can dominate the observed fragmentations. Upon collisional activation ...
  88. [88]
    Peptide Labeling with Isobaric Tags Yields Higher Identification ...
    Peptide Labeling with Isobaric Tags Yields Higher Identification Rates Using iTRAQ 4-Plex Compared to TMT 6-Plex and iTRAQ 8-Plex on LTQ Orbitrap. Click to copy ...
  89. [89]
    Peptide Labeling with Isobaric Tags Yields Higher Identification ...
    Jul 1, 2010 · In comparison to iTRAQ 4-plex the numbers of peptide-spectrum matches and unique peptides were approximately 40% lower with TMT 6-plex and more ...<|separator|>
  90. [90]
    Top-down analysis of 30-80 kDa proteins by electron transfer ...
    We show that ETD TOF MS is efficient and may provide extensive sequence information for unfolded and highly charged (around 1 charge/kDa) proteins of ~30 kDa.
  91. [91]
    Development of a top-down MS assay for specific identification ... - NIH
    Jun 19, 2024 · Here we present a fully developed top-down mass spectrometry assay for the characterization of periostin splice isoforms at the protein level.
  92. [92]
    N-terminal proteoforms may engage in different protein complexes
    Proteins originating from the same gene, yet differing at their N-terminus—so-called N-terminal proteoforms—can take part in different protein–protein ...<|control11|><|separator|>
  93. [93]
    Mass Spectrometry-Based Detection and Assignment of Protein ...
    Recent advances in mass spectrometry (MS)-based proteomics allow the identification and quantitation of thousands of posttranslational modification (PTM) sites ...
  94. [94]
    Considerations for defining +80 Da mass shifts in mass spectrometry ...
    Sep 26, 2023 · This article focusses on the MS-based analysis of those covalent modifications that induce a mass shift of +80 Da, notably phosphorylation and sulfation.
  95. [95]
    Protein Glycosylation Investigated by Mass Spectrometry: An Overview
    An overview of the most prominent techniques based on mass spectrometry (MS) for protein glycosylation (glycoproteomics) studies is here presented.
  96. [96]
    Best practices and benchmarks for intact protein analysis for top ...
    Jun 27, 2019 · The accurate mass measurement of an intact protein is the sine qua non of top-down mass spectrometry, which can characterize how proteoforms ...
  97. [97]
    Comprehensive Analysis of Protein Modifications by Top-Down ...
    Dec 1, 2011 · Top-down MS first measures the molecular weight (MW) of an intact protein and compares it with the calculated value based on the DNA-predicted ...Sample Preparation For... · Top-Down Ms Data Analysis... · Complete Ptm Mapping By...<|control11|><|separator|>
  98. [98]
    Matrix-assisted Laser Desorption/Ionization Time of Flight (MALDI ...
    Sep 9, 2013 · Here we present an accessible approach for analysing proteins larger than 100 kDa by MALDI-time of flight (TOF).
  99. [99]
    Ultra-High Mass Resolving Power, Mass Accuracy, and Dynamic ...
    Jan 19, 2020 · FT-ICR mass spectrometers provide the highest mass resolving power and mass accuracy of any mass analyzer, with up to parts-per-billion (ppb) ...
  100. [100]
    Orbitrap LC-MS | Thermo Fisher Scientific - US
    Orbitrap mass spectrometers deliver a total possible maximum resolution (FWHM) of 1,000,000 at m/z 200 and a sub-1 ppm mass accuracy in a single compact and ...
  101. [101]
    Instruments | Amster Lab - UGA
    The high mass accuracy of this instrument (<1 ppm) allows it to determine the elemental composition of molecules based on accurate mass alone. This instrument ...<|separator|>
  102. [102]
    Internal Fragments Generated from Different Top-Down Mass ...
    Jun 8, 2021 · Top-down mass spectrometry (TD-MS) of intact proteins results in fragment ions that can be correlated to the protein primary sequence.
  103. [103]
    Decoding protein modifications using top-down mass spectrometry
    (a) Top-down MS can distinguish between protein isoforms 1 and 2 (expressed from genes 1 and 2) with highly similar intact mass values based on differences in ...Missing: distinction | Show results with:distinction
  104. [104]
    Top-down Proteomics | Thermo Fisher Scientific - US
    Top-down proteomics detects degradation products, sequence variants, and PTMs. HRAM MS is essential for resolving intact proteins and their charge states.
  105. [105]
    Top–down Proteomics of Large Proteins up to 223 kDa Enabled by ...
    The 2225 proteoforms found in the 1D RPC-MS analysis were primarily low MW proteins ranging between 10 and 25 kDa in size (Table S1). The 2D sSEC-RPC-MS ...
  106. [106]
    Top-Down Analysis of Proteins in Low Charge States
    Feb 22, 2019 · In this study, the fragmentation behavior of the seven proteins in low charge states is evaluated. The proteins range in size from 8.5 to ...Missing: poor | Show results with:poor
  107. [107]
    Beyond mass spectrometry, the next step in proteomics - PMC
    Jan 10, 2020 · Sensitivity is paramount. A typical mass spectrometer detection limit is about 480 fg (20 counts/fg), which corresponds to about 10 amol or 6 ...
  108. [108]
    Mass spectrometry based proteomics: existing capabilities and ...
    Mass spectrometry (MS)-based proteomics is emerging as a broadly effective means for identification, characterization, and quantification of proteins.
  109. [109]
    Mass Spectrometry Accelerates Membrane Protein Analysis - PMC
    In this review, we focus on the eminence of shotgun MS for accelerating the identification and study of membrane proteins. Specifically, we briefly cover recent ...Missing: gaps | Show results with:gaps
  110. [110]
    A Handle on Mass Coincidence Errors in De Novo Sequencing of ...
    Aug 2, 2024 · Sequencing accuracy at the peptide level is limited by the isobaric residues leucine and isoleucine ... Keywords: alignment; antibodies; de novo ...
  111. [111]
    Common errors in mass spectrometry-based analysis of post ... - NIH
    Here, we review the most common errors in MS-based PTM analyses with the goal of adopting strategies that maximize correct interpretation.Missing: isoelectric | Show results with:isoelectric
  112. [112]
    PEAKS DB: De Novo Sequencing Assisted Database Search for ...
    The aim of PEAKS DB is to identify peptides from a sequence database with MS/MS data. As such, PEAKS DB belongs to the database search category of peptide ...Missing: PTH | Show results with:PTH
  113. [113]
    High-Throughput Sequencing of Peptoids and Peptide−Peptoid ...
    One of our laboratories has previously demonstrated that resin-bound peptides can be rapidly sequenced by partial Edman degradation/mass spectrometry (PED/MS).Missing: protein | Show results with:protein
  114. [114]
    Mechanism of alternative splicing and its regulation - PMC
    Alternative splicing of precursor mRNA is an essential mechanism to increase the complexity of gene expression, and it plays an important role in cellular ...
  115. [115]
    ORFfinder Home - NCBI - NIH
    ORF finder searches for open reading frames (ORFs) in the DNA sequence you enter. The program returns the range of each ORF, along with its protein translation.
  116. [116]
    Thousands of novel translated open reading frames in humans ... - NIH
    Abstract. Accurate annotation of protein coding regions is essential for understanding how genetic information is translated into function.
  117. [117]
    RNA splicing — a central layer of gene regulation - Nature
    May 21, 2025 · Alternative splicing greatly expands the coding potential of the genome; more than 95% of human multi-intron genes undergo alternative splicing ...
  118. [118]
    Codon—anticodon pairing: The wobble hypothesis - ScienceDirect
    This hypothesis is explored systematically, and it is shown that such a wobble could explain the general nature of the degeneracy of the genetic code.
  119. [119]
    EMBOSS TRANSEQ < Job Dispatcher < EMBL-EBI
    EMBOSS Transeq translates nucleic acid sequences to their corresponding peptide sequences. It can translate to the three forward and three reverse frames.<|control11|><|separator|>
  120. [120]
    Mixing genome annotation methods in a comparative analysis ...
    For the six-frame translated searches, we first generated a six-frame translation of the genome assembly of each species using the 'esl-translate' command ...
  121. [121]
    The Expanding Landscape of Alternative Splicing Variation in ...
    Genomic variants in splicing regulatory sequences can disrupt splicing and cause disease. ... protein sequence of SP140, minigene splicing reporter as- says ...
  122. [122]
    Alternative Splicing, RNA Editing, and the Current Limits of Next ...
    Splicing of pre-mRNA can result in the expression of the full encoded protein or (n) number of protein isoforms produced as a result of alternative splicing.
  123. [123]
    Mapping and quantifying mammalian transcriptomes by RNA-Seq
    May 30, 2008 · We have mapped and quantified mouse transcriptomes by deeply sequencing them and recording how frequently each gene is represented in the sequence sample (RNA- ...
  124. [124]
    A survey of best practices for RNA-seq data analysis | Genome Biology
    Jan 26, 2016 · We review all of the major steps in RNA-seq data analysis, including experimental design, quality control, read alignment, quantification of gene and ...
  125. [125]
    Translate tool - Expasy
    Translate is a tool which allows the translation of a nucleotide (DNA/RNA) sequence to a protein sequence. DNA or RNA sequence. Output format.
  126. [126]
    A high-throughput SNP discovery strategy for RNA-seq data
    Feb 27, 2019 · SNPs in the coding region can be divided into two types, synonymous and nonsynonymous SNPs, with protein sequence affected by the latter type.
  127. [127]
    Diagnosis of fusion genes using targeted RNA sequencing - Nature
    Mar 27, 2019 · We establish that fusion gene detection with targeted RNAseq is both sensitive and quantitative by optimising laboratory and bioinformatic variables.
  128. [128]
    Genome-Wide Analysis in Vivo of Translation with Nucleotide ...
    Apr 10, 2009 · We present a ribosome-profiling strategy that is based on the deep sequencing of ribosome-protected mRNA fragments and enables genome-wide investigation of ...
  129. [129]
    RNA-Seq: a revolutionary tool for transcriptomics - PMC - NIH
    RNA-Seq also provides a far more precise measurement of levels of transcripts and their isoforms than other methods.
  130. [130]
    [PDF] Single-molecule protein sequencing with nanopores
    Finally, we outline the advantages and limitations of nanopore systems for protein sequencing and the challenges that remain to be overcome for realizing de ...
  131. [131]
    Oxford Nanopore's roadmap to proteomics
    May 20, 2025 · Oxford Nanopore unveils its pioneering roadmap to full protein sequencing, advancing real-time, direct proteomics for transformative ...
  132. [132]
    Protein identification by nanopore peptide profiling - Nature
    Oct 4, 2021 · We show that an engineered Fragaceatoxin C nanopore is capable of identifying individual proteins by measuring peptide spectra that are produced from ...Missing: Squarix | Show results with:Squarix
  133. [133]
    Proteomics | Oxford Nanopore Technologies
    Our goal is to directly read and identify native proteins, just like we do with DNA and RNA, to empower research and improve health.Proteomics · In Development... · 1. Detection Of Protein...
  134. [134]
    Nanopore protein sequencing achieves significant new milestones
    One approach to nanopore sequencing of proteins utilizes peptides as intact polymers that can be fed into a nanopore using similar constructs as those developed ...Missing: Squarix | Show results with:Squarix
  135. [135]
    Multi-pass, single-molecule nanopore reading of long protein strands
    Sep 11, 2024 · When we considered top-N accuracy measurements, our model attained 67% accuracy for top-5 and 81% for top-8 accuracy in the 20-way ...
  136. [136]
  137. [137]
    The emerging landscape of single-molecule protein sequencing ...
    Jun 7, 2021 · To overcome this problem, recognition tunneling has been developed in which the electrodes are covalently modified with adaptor molecules ...
  138. [138]
    Single Molecule Spectroscopy of Amino Acids and Peptides by ... - NIH
    Apr 6, 2014 · We are currently developing recognition tunneling (RT) as an electronic single molecule sequencing method for DNA. Here, we show that the method ...
  139. [139]
  140. [140]
  141. [141]
  142. [142]
    software for protein identification from sequence tags with de novo ...
    For the identification of novel proteins using MS/MS, de novo sequencing software computes one or several possible amino acid sequences (called sequence ...
  143. [143]
    [2208.05598] PASS: De novo assembler for short peptide sequences
    Aug 11, 2022 · Here we present PASS, a de novo assembler for short peptide sequences that can be used to reconstruct large portions of protein targets.
  144. [144]
    Highly Robust de Novo Full-Length Protein Sequencing
    Feb 16, 2022 · De novo mass spectrometry (MS)-based assembly is an efficient way for full-length protein sequencing. A target protein is digested into peptides ...Missing: seminal | Show results with:seminal
  145. [145]
    Estimating error rates for single molecule protein sequencing ...
    In 2018 the Edman failure rate was measured at around 6%. Edman failure rates now appear to be around 1% or 2% for most residues and for much shorter TFA ...
  146. [146]
    Antibody sequences assembly method based on weighted de Bruijn ...
    Jan 31, 2023 · To address this problem, we propose a new assembly method, DBAS, which integrates the quality scores and sequence alignment scores from de novo ...<|control11|><|separator|>
  147. [147]
    UniProt
    UniProt is the world's leading high-quality, comprehensive and freely accessible resource of protein sequence and functional information.Blast · Downloads · UniProtKB 199,579,900 results · About UniProt
  148. [148]
    Where do the UniProtKB protein sequences come from? | UniProt help
    Aug 21, 2025 · Most UniProtKB sequences come from EMBL-Bank/GenBank/DDBJ translations, with some from PDB, direct sequencing, literature scans, and gene ...
  149. [149]
    RCSB PDB: Homepage
    RCSB Protein Data Bank (RCSB PDB) enables breakthroughs in science and education by providing access and tools for exploration, visualization, and analysis.About RCSB PDB · PDB Statistics · Team Members · Protein Data Bank
  150. [150]
  151. [151]
    RefSeq: NCBI Reference Sequence Database - NIH
    A comprehensive, integrated, non-redundant, well-annotated set of reference sequences including genomic, transcript, and protein.About RefSeq · RefSeqGene · Prokaryotic RefSeq Genomes · RefSeq SelectMissing: derived | Show results with:derived
  152. [152]
    NCBI Reference Sequence (RefSeq): a curated non-redundant ...
    RefSeq is a public database of nucleotide and protein sequences with corresponding feature and bibliographic annotation.
  153. [153]
    Mascot help: Overview of sequence database searching
    Mascot search overview. Mascot is a powerful search engine which uses mass spectrometry data to identify proteins from primary sequence databases.
  154. [154]
    Faster SEQUEST Searching for Peptide Identification from Tandem ...
    The goal of peptide identification by database search is to label each experimentally observed spectrum from an MS/MS run with the peptide most likely to have ...
  155. [155]
    Highly accurate protein structure prediction with AlphaFold - Nature
    Jul 15, 2021 · The AlphaFold network directly predicts the 3D coordinates of all heavy atoms for a given protein using the primary amino acid sequence and ...
  156. [156]
    SignalP 6.0 - DTU Health Tech - Bioinformatic Services
    The SignalP 6.0 server predicts the presence of signal peptides and the location of their cleavage sites in proteins from Archaea, Gram-positive Bacteria, Gram ...
  157. [157]
    SignalP 6.0 predicts all five types of signal peptides using protein ...
    We introduce SignalP 6.0, a machine learning model that detects all five SP types and is applicable to metagenomic data.
  158. [158]
    MaxQuant
    MaxQuant is a quantitative proteomics software package designed for analyzing large mass-spectrometric data sets. It is specifically aimed at ...MaxQuant.Live · Download MaxQuant v2.7.5.0 · Summer School · MQ Community
  159. [159]
    Andromeda: A Peptide Search Engine Integrated into the MaxQuant ...
    The computational proteomics pipeline starting from raw data files to reported protein groups and their quantitative ratios now appears unified to the user.
  160. [160]
    Quantum-Si and Researchers to Showcase Next-Generation Protein ...
    Feb 13, 2025 · Quantum-Si and Researchers to Showcase Next-Generation Protein Sequencing™ at US HUPO 2025 · Protein Sequencing with Single Amino Acid Resolution ...Missing: databases | Show results with:databases
  161. [161]
    dbPTM 2025 update: comprehensive integration of PTMs and ...
    Nov 11, 2024 · The dbPTM 2025 update significantly expands the database to include over 2.79 million PTM sites, of which 2.243 million are experimentally ...Missing: NGPS | Show results with:NGPS