摘要

本论文主要研究内容是将传统的全基因组测序与Hadoop框架结合的大数据测序平台研发,通过Hadoop中的HDFS分布式存储系统来提供高可靠的存储服务,结合基因测序的一系列软件工具(如:BWA、Samtools、Picard和GATK等)来进行测序流程设计,并引入第三方FreeMarker模板引擎来制定模板脚本,针对不同的样本数据生成定制化的脚本处理文件。将各个样本的处理脚本与Hadoop框架的MapReduce计算框架结合,以Map任务的方式提交到Hadoop集群的各个计算节点的Container容器中运行,从而实现基因测序的并行化测序处理流程,得到样本组中各个样本对的单体变异检测gVCF文件。再通过一个Reduce任务,将各个Map阶段的变异检测结果中包含基因变异位点的信息汇总到一个VCF文件,得到最终多个样本准确的变异集合。

通过本次毕业设计,让我学习了全基因组基因测序的全过程和Hadoop大数据框架的相关组件,并掌握许多关于脚本处理的过程及方法。

关键词 Hadoop;WGS;HDFS;MapReduce;全基因组测序;Hadoop分布式环境;变异检测


Title: Research on Hadoop-based genome sequencing big data analysis platform

Abstract:
The main research content of this paper is the development of the big data sequencing platform combining traditional whole genome sequencing with the Hadoop framework, providing highly reliable storage services through the HDFS distributed storage system in Hadoop, combined with a series of software tools for gene sequencing (eg, : BWA, Samtools, Picard, GATK, etc.) to design the sequencing process, and to introduce a third-party FreeMarker template engine to create a template script to generate customized script processing files for different sample data. Combining the processing scripts of each sample with the MapReduce computing framework of the Hadoop framework and submitting them to the Containers of each computing node of the Hadoop cluster as a Map task, to achieve the parallel sequencing process of gene sequencing, and to obtain each sample group The monomer variation of the sample pair detects the gVCF file. Then through a Reduce task, the information of the gene mutation sites contained in the mutation detection results of each Map stage is aggregated into a VCF file to obtain the final accurate variation set of multiple samples.
Through this graduation project, I learned the whole process of genome-wide gene sequencing and related components of Hadoop big data framework, and mastered many processes and methods of script processing.

Keywords: Hadoop; WGS; HDFS; MapReduce; whole genome sequencing; Hadoop distributed environment; mutation detection