AAindex2

The AAindex2 section currently contains 94 amino acid mutation matrices: 47 symmetric matrices and 19 non-symmetric matrices. The format of the entry is almost the same as that of AAindex1 except that it contains 210 numerical values (20 diagonal and 20 × 19/2 off-diagonal elements) for a symmetric matrix and 400 or more numerical values for a non-symmetric matrix (some matrices include a gap or distinguish two states of cysteine).

Record format

************************************************************************
*                                                                      *
* Each entry has the following format.                                 *
*                                                                      *
* H Accession number                                                   *
* D Data description                                                   *
* R PMID                                                               *
* A Author(s)                                                          *
* T Title of the article                                               *
* J Journal reference                                                  *
* * Comment or missing                                                 *
* M rows = ARNDCQEGHILKMFPSTWYV, cols = ARNDCQEGHILKMFPSTWYV           *
*   AA                                                                 *
*   AR RR                                                              *
*   AN RN NN                                                           *
*   AD RD ND DD                                                        *
*   AC RC NC DC CC                                                     *
*   AQ RQ NQ DQ CQ QQ                                                  *
*   AE RE NE DE CE QE EE                                               *
*   AG RG NG DG CG QG EG GG                                            *
*   AH RH NH DH CH QH EH GH HH                                         *
*   AI RI NI DI CI QI EI GI HI II                                      *
*   AL RL NL DL CL QL EL GL HL IL LL                                   *
*   AK RK NK DK CK QK EK GK HK IK LK KK                                *
*   AM RM NM DM CM QM EM GM HM IM LM KM MM                             *
*   AF RF NF DF CF QF EF GF HF IF LF KF MF FF                          *
*   AP RP NP DP CP QP EP GP HP IP LP KP MP FP PP                       *
*   AS RS NS DS CS QS ES GS HS IS LS KS MS FS PS SS                    *
*   AT RT NT DT CT QT ET GT HT IT LT KT MT FT PT ST TT                 *
*   AW RW NW DW CW QW EW GW HW IW LW KW MW FW PW SW TW WW              *
*   AY RY NY DY CY QY EY GY HY IY LY KY MY FY PY SY TY WY YY           *
*   AV RV NV DV CV QV EV GV HV IV LV KV MV FV PV SV TV WV YV VV        *
* //                                                                   *
************************************************************************
class aaindex.aaindex2.AAIndex2[source]

Bases: _AAIndexMatrix

Python parser for AAindex2: Amino Acid Substitution Matrix Database.

Inherits all parsing, search, and lookup functionality from _AAIndexMatrix. Stores the 94 known 20x20 substitution matrices from AAindex2 (http://www.genome.jp/aaindex/).

References

[1]: Kawashima, S. and Kanehisa, M.; AAindex: amino acid index database.

Nucleic Acids Res. 28, 374 (2000).

__contains__(record_code: object) bool

Return True if record_code exists in the database.

__getitem__(record_code: str) Map

Return a record by accession number wrapped in a Map (dot-notation dict).

Parameters:

record_code – AAindex accession number (case-insensitive, leading/trailing whitespace is stripped).

Returns:

Record data as a Map, accessible via dict or dot notation.

Raises:
  • TypeError – If record_code is not a string.

  • ValueError – If record_code is not found in the database.

__iter__() Iterator[str]

Iterate over all accession numbers in the database.

__len__() int

Return total number of records in the database.

__repr__() str

Return a canonical string representation of this instance.

__sizeof__() int

Return the on-disk size of the raw AAindex data file in bytes.

property aaindex_filename: str
amino_acids() List[str]

Return sorted list of the 20 canonical amino acid single-letter codes.

Derived from the row_order field of the first record in the database.

Returns:

Sorted list of single-letter amino acid codes.

property data_dir: str
get(record_code: str, aa1: str, aa2: str) float | None

Return the pairwise matrix score for two amino acids from a given record.

The matrix is symmetric so get(code, aa1, aa2) == get(code, aa2, aa1). Returns None when either amino acid carries an NA value in the source data or when the amino acid letter is not present in this record’s matrix.

Parameters:
  • record_code – AAindex accession number.

  • aa1 – Single-letter code for the first amino acid.

  • aa2 – Single-letter code for the second amino acid.

Returns:

Pairwise score as float, or None if data is not available.

Raises:
  • TypeError – If aa1 or aa2 are not strings.

  • ValueError – If record_code is not found in the database.

property last_updated: str
num_records() int

Return the total number of records in the database.

Returns:

Number of records as int.

parse_aaindex() Dict

Parse the raw AAindex database file into a nested dict and cache as JSON.

Each record is keyed by its accession number and stores metadata alongside the full symmetric 20x20 matrix reconstructed from the lower-triangular source data. The result is written to a .json file in the data directory for fast subsequent loads.

Returns:

Parsed database keyed by accession number.

Return type:

dict

Raises:
  • IOError – If the raw database file cannot be opened.

  • ValueError – If a duplicate accession number is encountered.

record_codes() List[str]

Return sorted list of all accession numbers in the database.

Returns:

Sorted list of accession number strings.

record_names() List[str]

Return a list of description strings for all records.

Returns:

List of description strings in database insertion order.

search(description: str | List[str]) Dict

Search records by keyword(s) present in their description field.

Parameters:

description – Keyword string or list of keyword strings. Matching is case-insensitive.

Returns:

Dict of matching records keyed by accession number. Returns an empty dict if no records match.

Raises:

TypeError – If description is not a str or list.

values(record_code: str) Dict

Return the full 20x20 matrix dict for a given record.

Shortcut to avoid accessing the whole record when only the matrix is needed. Consistent with AAIndex1.values() which returns amino acid values.

Parameters:

record_code – AAindex accession number.

Returns:

Nested dict of pairwise scores keyed by single-letter amino acid codes.

Raises:

ValueError – If record_code is not found in the database.

References

[1] Kawashima, S. and Kanehisa, M.; AAindex: amino acid index database. Nucleic Acids Res. 28, 374 (2000).