Dataset of academic performance evolution for engineering students.

This data article presents data on the results in national assessments for secondary and university education in engineering students. The data contains academic, social, economic information for 12,411 students. The data were obtained by orderly crossing the databases of the Colombian Institute for the Evaluation of Education (ICFES). The structure of the data allows us to observe the influence of social variables and the evolution of students' learning skills. In addition to serving as input to develop analysis of academic efficiency, student recommendation systems and educational data mining. The data is presented in comma separated value format. Data can be easily accessed through the Mendeley Data Repository (https://data.mendeley.com/datasets/83tcx8psxv/1).


Specifications table
Social Sciences Specific subject area Education Type of data Raw, analyzed and descriptive statistical data Parameters for data collection The data collection process was done under the rational analysis of the researchers, identifying the criteria that could be useful for analyzing the academic performance of Engineering students in two periods. First, the evaluation made at the end of high school and Second the evaluation carried out at the end of their professional training. For this, database cross-section criteria were used that allowed the association of the information of the secondary education stage with the professional training in Engineering.

Description of data collection
The observations correspond to the results of the evaluation in two moments of education for Engineering students in Colombia. The first moment corresponds to the results of the secondary evaluation and the second moment to the results of the professional evaluation, in addition variables of the social context in which the students live are added Data source location Bogotá, Colombia Data accessibility The data is available at https://data.mendeley.com/datasets/83tcx8psxv/1

Value of the data
• The data shown are very useful for the development of tools to control the direction of educational processes, particularly at the levels of secondary and professional education. This is possible because the configuration of the data set allows analyzing the relative contribution of the variables, in addition to the influence that one variable has on others, for example, the influence that the university or college has on the final score • The scores of student evaluations are useful for performing efficiency analyses, considering both High Schools and universities as Decision Making Units (DMU's) • The variables present in the dataset are fit to create prediction, classification and evaluation models of academic and social variables. • Social variables such as socioeconomic status are useful to understand their influence on the results of their tests; on the other hand, the gender distribution variable by career could be used to analyze the situation of women in Engineering from Colombia. • It doesn't exist a public common student ID that enables merges the databases. For that purpose, a formal request was presented to the Colombian Institute for Assessment of Quality Education, to indicate the linking ID for each student's records for both High School and University scores on National standardized tests. Besides a conscious process of cleaning and debugging was performed to guarantee the anonymous of the records.

Data description
The data set contains 12,411 observations where each represents a student and has 44 variables. The variables correspond to the student's personal information (categorical) and the result obtained in the assessments (numerical). The academic assessment is recorded at two moments of the student life. First, the scores of the national standardized test at the final year of the high school (Saber 11), evaluating five generic academic competencies. Mathematics (MAT_S11), assesses the skills of students to face situations that may be resolved with the use of some math tools. Critical Reading (CR_11), Assesses the skills needed to understand, interpret and evaluate texts that can be found in everyday life and at academic non-specialized contexts. Citizen Competencies (CC_S11), assesses the student's knowledge and skills that allow him to understand the social world from the perspective of social sciences and place this understanding as a reference in the exercise of his role as a citizen. Biology (BIO_S11), assesses the ability of the student to explain how some phenomena of nature occur based on observations, patterns and concepts of scientific knowledge. English (ENG_S11), assesses the competence to communicate effectively in English.
The second moment of academic assessment is at the final year of the professional career on Engineering, recorded on the national standardized test for higher education (SABER PRO). Similar to SABER 11 test, five generic academic competencies are assessed. Critical Reading (CR_SPRO), assesses the ability to understand a text either locally or globally and the critical approach to it. Quantitative reasoning (CR_PRO), assesses the ability to understand and manipulate quantitative data in different representations whether tables, graphs or diagrams. Citizen competencies (CC_PRO), assesses the concept of citizenship and inclusive coexistence within the framework proposed by the Colombian constitution. Written communication (WC_PRO), assesses student's ability to transmit in writing his ideas related to a topic. English (ENG_PRO), assesses the competence to communicate effectively using the English language.
The information corresponding to students personal information level was fulfilled by the student at the enrolment to the exam. For example, the variable socioeconomic level in Colombian is related to the Neighbourhood where the student lives. The variable 'sisben' refers to the economic aid program that the Colombian government grants to low-income families to improve their quality of life. The variables Internet, TV, Computer, WASHING_MCH, MIC_OVEN, CAR, DVD, FRESH, PHONE and MOBILE, indicate if in the student's home there are said services or appliances, with answer categories Yes / No.
The data can be accessed in the Mendeley data repository and downloaded in xlsx spreadsheet format. The data dimension is 12,411 rows, each corresponding to a student and 44 variables.
The gender distribution of students corresponds to 5043 (40.63%) for women and 7368 (59.37%) for men. To better illustrate the dataset, Tables 1 and 2 are presented for their description. Table 1 shows the numerical variables of the data set, in the first column they are presented as the variable is coded, the second column the original name of the variable, the third column the general average of the data of that variable, the third column is the deviation of the variable, finally, the fourth and fifth column are the maximum and minimum of each variable correspondingly. Table 2 shows the categorical variables of the data set, the first column is the  coded name of the variable, the second column the original name of the variable and finally, the third column represents the levels or categories that each variable possesses. On the other hand, a summary of some variables of the data set for each academic program is presented in Table 3 , the first column has the name of the academic program, the second column has the percentage of women belonging to the academic program and the percentage of men is in the third column, in the fourth column the percentage of students who come from a public school and in the fifth column the percentage of students who come from a private school, in the sixth column is the average result of the Engineering Project Formulation variable FEP_PRO and the last column presents the overall average score of the professional evaluation G_SC.

Experimental design, materials, and methods
For the design of the database, the list of crosses that related the code of the secondary test and the code of the professional test of a student was needed, then we proceed to download the databases of both tests according to the years that indicate the codes (Example: SB2006XX crossed with EK2018XX, year 2006 for the secondary test and year 2018 for the professional test of the student). For each database the extraction of the variables of interest is performed, and a filter is applied to the Engineering programs analyzed in the study. Once the filter is finished, the two databases are joined in a documented document through the crossings, followed by this, the data is encoded and cleaned in the desired format. The format was carried out in such a way that it would facilitate to identify the information flow of the results of the secondary and professional test; It also allowed an easy interpretation and manipulation of them. The data was manipulated with the tidyr library [1] and dyplr [2] of the R software [3] .