This is an university project develop for the course Big Data. The purpose of this project is to perform an analysis of DNA subsequences samples, named barcode, with the aim to classify and predict the species bolonging from a specific genomic sequence. In particular we conducted an analysis based on the concept of multi-locus, comparating the results with the past mono-locus research.

The work was based on the data presents in the BOLD database. Specifically we considered species related to the family of plants and fungi.

The tool was realized by the combination of several components, develop by Bash Scripting and Python languages, along with a map-reduce stage. To use it, put the data in a foldier named “data” and start “fastaSplitter.sh” from a bash terminal.

More detailed information about the project can be found on this paper (located in italian)


  • machine learning ,
  • bold ,
  • big data ,
  • map-reduce ,
  • python ,
  • bash scripting