[cpp] FASTA 파일을 인덱스를 사용하여 읽는 방법

들어가며

안녕하세요 한주현입니다

오늘은 cpp로 fasta 파일을 읽는데 index를 사용하여 전체 파일을 읽지 않고,

미리 인덱스 파일을 읽어 해당 위치로 이동하여 읽을 수 있는 코드에 대해 말해보겠습니다.

FASTA 파일 설명

먼저 FASTA 파일과 FASTA 파일의 index 파일에 대해서 설명해보겠습니다.

FASTA 파일은 아래와 같이 생겼습니다.

> 로 시작하는 header 줄이 있고

그 다음은 염기서열이 있는 줄 들이 있습니다.

각각의 >header 와 서열들을 하나의 record라고 부릅니다.

FASTA 파일에는 하나의 record가 들어있을 수도 있고 여러개의 records가 있을 수도 있습니다.

예시에서는 세 개의 records가 있고 각각의 record는 한 줄의 염기 개수가 60자로 되어있습니다.

$ cat sample.fasta
>sample1
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
>sample2
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
>sample3
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT

samtools를 사용하여 FASTA 파일의 indexing을 해보겠습니다.

$ samtools faidx sample.fasta

FASTA 파일이 indexing이 되었습니다.

이 파일을 살펴보면 아래 와 같이 생겼습니다.

$ cat sample.fasta.fai
sample1 80      9       60      61
sample2 84      100     60      61
sample3 87      195     60      61

첫 번째 컬럼은 header 줄에 있는 내용.

두 번째 컬럼은 각 record에 해당하는 염기서열의 숫자.

세 번째 컬럼은 각 record의 염기서열이 시작하는 offset.

네 번째 컬럼은 한 줄에 포함된 염기서열의 개수.

마지막 컬럼은 한 줄에 포함된 염기서열의 개수와 가장 끝의 new line의 개수까지 포함한 값.

입니다.

파일을 전체를 읽지 않고 원하는 부분만 읽을 수 있는 전략

예를 들어 >sample1 의 offset, 즉 record에서 염기서열이 시작하는 지점은 9 입니다.

그 이유는 >sample1\n 의 각각 character 개수를 세어본다면 9 글자 이기 때문입니다.

그래서 9번째 부터 읽게 되면 sample1의 record의 염기를 읽을 수 있습니다.

cpp 코드

	#include <fstream>
	#include <iostream>
	#include <sstream>

	inline int ROW = 5;
	/**
	* @brief split a string by a delimiter
	*
	* @param s
	* @param delim
	* @param tokens
	* @return true if the string is split successfully
	* @return false if the string is not split successfully
	*/
	bool split(const std::string &s, char delim, std::vector<std::string> &tokens) {
	std::string token;
	std::istringstream iss(s);
	while (std::getline(iss, token, delim)) {
	tokens.push_back(token);
	}
	return true;
	}

	/**
	* @brief read the index file
	*
	* @param filename
	* @param positions
	* @return true if the index file is read successfully
	* @return false if the index file is not read successfully
	*/
	bool read_index(const std::string &filename,
	std::vector<std::pair<std::string, int>> &positions) {
	std::ifstream ifs(filename);
	if (!ifs.is_open()) {
	std::cerr << "Error: cannot open file " << filename << std::endl;
	return false;
	}
	std::string line;
	while (std::getline(ifs, line)) {
	std::vector<std::string> tokens;
	split(line, '\t', tokens);
	std::pair<std::string, int> p = {tokens[0], std::stoi(tokens[2])};
	positions.push_back(p);
	}
	return true;
	}

	/**
	* @brief read the fasta file with an offset
	*
	* @param fasta_filepath
	* @param seq
	* @param offset
	* @return true if the fasta file is read successfully
	* @return false if the fasta file is not read successfully
	*/
	bool read_fasta(const std::string &fasta_filepath, std::string &seq,
	int &offset) {
	std::ifstream ifs(fasta_filepath);
	if (!ifs.is_open()) {
	std::cerr << "Error: cannot open file " << fasta_filepath << std::endl;
	return false;
	}
	char *buffer = new char[ROW];
	ifs.seekg(offset, std::ios::beg);
	ifs.read(buffer, ROW);
	ifs.close();
	seq = buffer;
	delete[] buffer;
	return true;
	}

	int main(int argc, char *argv[]) {
	if (argc != 3) {
	std::cerr << "#usage: " << argv[0] << " <reference.fasta> <chromosome>"
	<< std::endl;
	return -1;
	}
	std::string fasta_filepath = argv[1];
	std::string chromosome = argv[2];
	std::string fasta_index_filepath = fasta_filepath + ".fai";
	std::vector<std::pair<std::string, int>> positions;
	if (!read_index(fasta_index_filepath, positions)) {
	return -1;
	}

	for (auto &p : positions) {
	if (p.first == chromosome) {
	std::string seq;
	int offset = p.second;
	read_fasta(fasta_filepath, seq, offset);
	std::cout << seq << std::endl;
	break;
	}
	}

	return 0;
	}

view raw read_fasta_with_index.cpp hosted with ❤ by GitHub

이번 글에서 핵심적이라고 말할 수 있는 부분은

read_fasta 함수에서 ifs.seekg(offset, std::ios::beg) 부분입니다.

offset만큼을 이동하여 읽는 부분으로

이렇게 하게 되면 일반적으로 human reference genome fasta 파일이 3GB 인데 이를 전부 읽지 않고도 특정 위치만 읽어낼 수 있습니다.

즉 메모리 효율적으로 읽을 수 있다는 점이죠.

오늘은 offset을 사용하여 FASTA 파일을 읽는 방법에 대해 알아보았습니다.

여러분들께 도움이 되셨으면 합니다.

그럼 다음 포스팅에서 또 만나요.

저작자표시 비영리 동일조건

'컴퓨터 > C & C++' 카테고리의 다른 글

[C++] google test 실제 코드에서 컴파일 및 수행 방법 (2)	2023.09.03
[C++] google test 다양한 ASSERT, EXPECT 테스트 방법 (0)	2023.09.03
[C++] google test 수행 방법, ASSERT와 EXPECT (0)	2023.09.03
[C++] google test 설치 방법 (0)	2023.09.03
[C++] 큰 수 더하기 (adding big numbers) (1)	2023.08.05

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

생물정보학자의 블로그

[cpp] FASTA 파일을 인덱스를 사용하여 읽는 방법

들어가며

FASTA 파일 설명

파일을 전체를 읽지 않고 원하는 부분만 읽을 수 있는 전략

'컴퓨터 > C & C++' 카테고리의 다른 글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

[cpp] FASTA 파일을 인덱스를 사용하여 읽는 방법

들어가며

FASTA 파일 설명

파일을 전체를 읽지 않고 원하는 부분만 읽을 수 있는 전략

'컴퓨터 > C & C++' 카테고리의 다른 글

관련글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역