Data Management and Exploratory Data Analysis

The scientific method – Question, research, hypothesis, experiment, analyse and conclusion

The crisp method – Business understanding, data understanding, data preparation, modelling, evaluation and deployment

Big datavolume, velocity, variety, veracity

Reasons to use R:

  • R is open use and free
  • It is the language of statisticians
  • You can combine R with Latex

Text editors: R Studio, Notepad++ or Emacs

Open scienceany initiative that aims at lowering or erasing the technical, social, and cultural barriers that prevent scientists from sharing knowledge with one another and with individuals outside of the academic community, but also the barriers that prevent anyone from producing knowledge.

Needs to have visibility, scrutiny, ability to reuse and public access

For a full publication, it needs both the code and the data

Version control – The most common way of producing version control is through GitHub. The best practices are: Commit little and often, use branches for new features and use protected branches on large projects.

Process:

  1. Measured data
  2. Analytic data (tidied version of the prior)
  3. Computational results
  4. Figures, tables, numerical summaries
  5. Article