In wenchi0922/PISA2012_: A data analysis task on PISA2012

Abstract

In this research, I want to explore whether there is a discrepancy among scores in mathematic literacy within Australia, and how education resource(say, shortage of math teacher) will impact the assessment result. For instance, comparing the education resources between remote and city areas and the discrepancies between private or public schools. Based on the PISA2012 mathematics literacy survey data using statistical analysis like linear regression model and ANOVA test. Also, I will try to break down the relationship between school location and how it affects the scores in math literacy, the relationship between private or public schools and how it relates to the students' scores in math literacy.

Introduction

PISA2012: the term PISA stands for Programme for International Student Assessment is a survey that randomly select 15 year-old students(students who are attending secondary schools) as samples for assessment. In 2012, a total of 65 OECD countries and economics and about half a million 15 year-old students participated in the PISA assessment[@report]. Generally, the assessed reaults lies in 5 categories: Level 5 are high performers, whereas students who lies below the international standard baseline level 2 are cosidered low performers. In 2012, 775 Australian schools and 14,481 students participated in this assessment[@aupisa]. To ensure the authenticity, an amount of indigenous students were also sampled.
Students are categorized in 6 different levels in accordance to their proficiency in the assessment[@aupisa]:
Level 6: Students who score higher than 669.3 scores belongs to level6.
Level 5: Students who score higher than 607.0 scores belongs to level5.
Level 4: Students who score higher than 544.7 scores belongs to level4.
Level 3: Students who score higher than 482.4 scores belongs to level3.
Level 2: Students who score higher than 420.1 scores belongs to level2.
Level 1: Students who score higher than 357.8 scores belongs to level1.
Below 1: not demonstrate even the most basic types of mathematical literacy that PISA measures. These students are likely to be seriously disadvantaged in their lives beyond school.
Mathematic Literacy: in the mathematic literacy domain, the assessment focused on students' ability to solve mathematic problems described in a real-life situation. In PISA2012 framework, it defined mathematic literacy as follows: "Mathematic literacy is an individual's capacity to formulate, employ and interpret mathematics in a variety of contexts. It includes reasoning mathematically nd using mathematical concepts, procedures, facts and tools to describe, explain and predict phenomena. It assists individuals to recognise the role that Mathematics plays in the world and to make the well-founded judgments and decisions needed by constructive, engaged and reflective citizens."

The assessment in math literacy is designed according three main components:

the context of a challenge or problem that arises in the real world
the nature of mathematical thought and action that can be used to solve the problem
the processes that the problem solver can use to construct a solution.

Overview of Australian Education System:

--Terms and definition

It is defined in the handbook from PISA[@pisamanual] that the size of school is defined as : Village School or Rural Area: less than 3,000 people Small Town School: 3,000 to about 15,000 people Town School: 15,000 to about 100,000 people City School: 100,000 to about 1,000,000 people, for example, Hobart, Tasmania Large City School: ith over 1,000,000 people, for example, Sydney and Melbourne

--Facts of Australian Education[@review], [@edtech]

In 2016, there are over 9,400 schools in which 1400 of them are secondary schools across Australia. Which means more than half of the amount of secondary schools were participants of PISA2012 assessment.
In 2016, there are over 6,000 government(public) schools, more than 1700 Catholic schools(private) and more than 1,000 independent schools(private)
In 2017, around 65% of students attended government school, 19% students attended Catholic schools and 16% students attended independent schools.
In 2014, the proportion of residents aged between 25~34 years old who has a degree was: Major City 42.2 %, Inner Regional 21.8%, Outer Regional 19.5%, and Remote and Very Remote 17.8%.

Methods Used

-- Libraries used

ggplot2: for plot generation [@ggplot]
PISA2012lite: provides original data set [@PISA2012]
lme4: for linear regression analysis [@lme4]
magrittr: for specific functions[@mag]
data.table: enables faster and efficient data handling [@table]
dplyr: to enable specific function for example, "select" function [@dplyr]

-- PISA2012 data

In this analysis, the original dataset is provided by library [@PISA2012] The PISA2012lite dataset contains 10 data tables, including the survey result of school questionnaire and parent questionnaire, plus the student's questionnaire and assesment result. Including 775 Australian schools and 14,481 students participated. Indegi I used the 5 plausible values(PV1Math~PV5MATH) for as the measure of a student’s mathematical literacy. The minimum score in this assessment is 0 and maximum is 1,000. For reliability, smaller states and indigenous students were oversampled in this assessment.

-- Statistical Test Used:

Since the the data to be analyzed contained samples more than 30, so we will be running a F-test instead of a t-test. Before we start with the statistical testing, we have to make sure the data is normally distributed to allow further F-test.

Linear Regression Analysis: linear regression is one of the most popular statistical test, it's an approach for modeling the relationship between a dependent variable(Y) and one or multiple variables(X).

ANOVA Test: the term ANOVA test originally stands for Analysis of Variance which is used to analyze the differences between means among two test groups.

Tukey Test: Tukey test is often used with ANOVA test, it allows researchers to compare the means of all possible pairs.

-- Hypothesis:

H0(null hypothesis): the variables mentioned is not going to affect the students' test result on PISA 2012 mathematical literacy.

H1(alternative hypothesis): school location, education resource and whether or not the school is privately owned or publicly owned will affect the students' performance on mathematical literacy.

Results

library(data.table)
library(dplyr)
library(ggplot2)
library(PISA2012lite)
library(lme4)
library(magrittr)
data("school2012")
setDT(school2012)
data("computerStudent2012")
setDT(computerStudent2012)

#calculate the weighted average of the Math Literacy performance using PV1MATH to PV5MATH
pv_cols <- paste0("PV", 1:5, "MATH")
student_data <- computerStudent2012[NC == "Australia", 
                                    lapply(.SD, weighted.mean, w = W_FSTUWT / sum(W_FSTUWT)),
                                    by = SCHOOLID, .SDcols = c(pv_cols, "ESCS") ][, .(SCHOOLID, ESCS, 
                                                                       Mean_PVMATH = Reduce(`+`, .SD) / length(.SD)), 
                                                                       .SDcols = pv_cols]
#a summary table showing the overall mean math score 
student_data[,  .(mu = mean(Mean_PVMATH, na.rm = T), sigma = sd(Mean_PVMATH, na.rm = T))]

#test whether the data is normally distributed
#' Put data onto a standard normal scale
#' @param x A numeric vector
#' @return A numeric vector of same length as `x`. This vector has mean 0 and sd 1.
standardise <- function(x) {
  (x - mean(x)) / sd(x)
}

#a plot test for normality
ggplot(student_data) + stat_qq(aes(sample = Mean_PVMATH %>% standardise)) + geom_abline()

As we can see from the plot above, the column Mean_PVMATH is normally distributed. Thereby, we can proceed to the statistical test we are going to perform in the following paragraph.

The following graph shows the amount of private and public schools in Australia that participated:

#plot a bar chart showing the count of private and public schools participated
ggplot(school2012[NC == "Australia" & !is.na(SC01Q01)], aes(x=SC01Q01))+geom_bar(fill="#E69F00", colour="black", stat="count")+labs(x="Private or Public", y="Total Count")

The following is a graph about the distribution of schools in Australia that participated:

#plot bar chart showing the school location for schools participated:
ggplot(school2012[NC == "Australia" & !is.na(SC03Q01)], aes(SC03Q01))+geom_bar(fill="#E69F00", colour="black")+labs(x="School Location", y="Total Count")

The following cosidering the socio-economic state of the student using Mean_PVMATH:

#a point plot showing the relationship between socio-economic state and math performance
ggplot(data = student_data, aes(x = ESCS, y = Mean_PVMATH)) +
  geom_point(aes(colour = SCHOOLID)) + geom_smooth(fill="black", colour="darkblue", size=1)+ theme(legend.position="none")+labs(x="Socio-economic state", y="Math Performance")

The graph shows a positive relationship between socio-economic state of a student and their math performance.

#create a view selecting country Australia and column SC14Q02 (Lack of math teacher)
school_view <- school2012[NC =="Australia" & !is.na(SC14Q02)]
school_plot_data <- select(school_view, SC14Q02, SCHOOLID)

#merge two tables into one table for plotting use
plot_data <- merge(student_data, school_plot_data, by="SCHOOLID")

#a simple summary table
plot_data[, .(mu = mean(Mean_PVMATH, na.rm = T),
              sigma = sd(Mean_PVMATH, na.rm = T)),
              by = SC14Q02]

#create a box plot
ggplot(plot_data, aes(x=factor(SC14Q02), y=Mean_PVMATH))+geom_boxplot()+labs(x="Lack of Math Teacher", y="Math Performance")

#run linear regression test
fit <- lm(Mean_PVMATH~SC14Q02, data=plot_data)
summary(fit)

#run ANOVA test
plot_anova <- aov(Mean_PVMATH~SC14Q02, data = plot_data)
summary(plot_anova)

#run Tukey test
tuk <- TukeyHSD(plot_anova)
tuk
plot(tuk)

The boxplot infers that there is a difference in the mean score of students from different shortage situation of math teachers. And it impacts the students' math assessment result greatly.

Now we want to look into the relationship between schools and weighted math performances from PV1MATH to PV5MATH:

#filter out the data from Australia and select column SC03Q01(School location) and SchoolID
school_view <- school2012[NC =="Australia" & !is.na(SC03Q01)]
school_plot_data <- select(school_view, SC03Q01, SCHOOLID)

#merge two tables into one table for plotting use
plot_data <- merge(student_data, school_plot_data, by="SCHOOLID")

#a simple summary table
plot_data[, .(mu = mean(Mean_PVMATH, na.rm = T),
              sigma = sd(Mean_PVMATH, na.rm = T)),
              by = SC03Q01]

#merge two tables: student_data and school_plot_data and create a boxplot
ggplot(plot_data, aes(x=factor(SC03Q01), y=Mean_PVMATH))+geom_boxplot()+labs(x="School Location", y="Math Performance")

#run linear regression test
fit <- lm(Mean_PVMATH~SC03Q01, data=plot_data)
summary(fit)

#run ANOVA test
plot_anova <- aov(Mean_PVMATH~SC03Q01, data = plot_data)
summary(plot_anova)

#run Tukey test
tuk <- TukeyHSD(plot_anova)
tuk
plot(tuk)

The boxplot shows there is a significant difference in the mean score of students from different region of Australia. The graph implies that students from large city in Australia, for example, Sydney and Melbourne normally scores higher than students from a village. Those pairs that are significantly different according to to result of Tuckey test are those that does not across the 0 value.

school_view <- school2012[NC =="Australia" & !is.na(SC01Q01)]
school_plot_data <- select(school_view, SC01Q01, SCHOOLID)

#merge two tables into one table for plotting use
plot_data <- merge(student_data, school_plot_data, by="SCHOOLID")

#a simple summary table
plot_data[, .(mu = mean(Mean_PVMATH, na.rm = T),
              sigma = sd(Mean_PVMATH, na.rm = T)),
              by = SC01Q01]

#create a boxplot comparing the mean value
ggplot(plot_data, aes(x=factor(SC01Q01), y=Mean_PVMATH))+geom_boxplot()+labs(x="Private or Public", y="Math Performance")

#run linear regression test
fit <- lm(Mean_PVMATH~SC01Q01, data=plot_data)
summary(fit)

#run ANOVA test
plot_anova <- aov(Mean_PVMATH~SC01Q01, data = plot_data)
summary(plot_anova)

The boxplot shows there is a great difference in the mean score of students from different sectors of Australia. The graph implies that students from private schools in Australia outperform the students from government/public schools.

Discussion

The summary shws that the mean score for all Australian students is 495.5 and there are more public schools than private schools attended.

A stuent's socio-economic state will impact their mathematical literacy assessment.

A statistical look into the shortage of math teachers shows that students from schools that has no shortage of math teachers at all have a hghest mean score of 514.80, whereas students from schools that is lacks a lot of math teachers is 475.73. The linear regression test shows that whether the school is in short of math teachers will impact the student's math assessment result greatly since the p-value is very small(<0.05)

From the Tukey test, we can conclude that:

There is no significant discrepancy in performances between schools that is facing A lot-Very little and A lot-To some extent shortage of math teachers.(since the p-value is greater than 0.05)
There is a significant difference between schools that is facing To some extent-Not at all, Very little-Not at all and A lot-Not at all shortage of math teachers.(since the p-value is equal to 0 or almost equal to 0)

According to the school location analysis, we can tell that there is a positive relationship between the school location and the student's math literacy score. The mean score for all Australian students is 495.5, whereas the students from a village is 448.41, mean score for students from a large city is 518.59. The difference between mean scores is (518.59-448.41)=70.18 points.

In the linear regression test for school location, the p values are all smaller than 0.05 which implies that the store locations are highly related to a student's math performance. In the ANOVA test, the F value is 29.76 and p-value is very small(less than 0.05). Which also implies that there is a significant relationship between the school location and the scores in students' math literacy. And since there are multiple(more than 2) factors in this dataset, we will run another tukey test to compare the differences between each paired groups.

From the Tukey test above, telling from the column diff and p adj, these facts can be able to conclude:

There is no significant discrepancy in performances between a Town school and a Small Town School since the p=0.55>0.05 and Small Town-Village since p=0.07>0.05.
There is significant discrepancy in math performace between City-Village, Large City-Village, Large City-Samll Town, Large City-Town, since the p value all equal to 0. These significant difference is shown in the plotted Tukey graph.

According to the private/public analysis, we can tell that there is a positive relationship between whether the student enters a private or public/government school is a factor that is affecting their PISA mathematical literacy assessment. While the mean score for all Australian students is 495.5, the mean score for students from independent/private sector is 521.067, mean score for students who attend in public/government school is 479.2. The difference is (521.067-479.2)= 41.87 points. In the linear regression test from the private/public test, the p values are all smaller than 0.05 which implies that the whether the school is a private on or public one is highly related to a student's math performance. In the ANOVA test, the F value is 104.9 and p-value is very small(less than 0.05). Suggesting that there is a strong relationship between whether a school is private or public can affect the student's score on math literacy. Since there are only two levels of factors in this test, we will not run a Tukey test like above.

Evidence shows that whether the school is privately or publicly owned and also the school location and education resource is going to affect the students' mathematical literacy score on PISA 2012. Thus we will be rejecting my H0 hypothesis, hence we will be accepting my H1 hypothesis that whether a school is privately or publicly owned and the location of schools will impact on students' test reults on PISA2012 mathematical literacy assessment.

Conclusion

A few results can be concluded accroding to the analysis result:

Students from private/independent schools significantly outperform public/government students.
Students from large city outperform students from other scetors of students.
Students with a higher socio-economic state also show a higher level of proficiency in math literacy.
Averagely, the result of all Australian students fall in level3 of proficiency.
Although the mean score for large city student's is highest among the but the standard deviation is a largest too, implying that the score distribution of large city students is wider than of other sectors.
Since students from a higher socio-economic family is more likely to enter a private school, these batch of students contributed to the higher performance in mathematical literacy of private schools.
The shortage of math resource will also impact on the student's math performance. However, it's interesting that once the school is in short of math teachers, the difference shortage level of math teachers does not impact greatly on the performance. For example, there is no great difference between the mean score of students from schools who are facing: shortage of math teacher to some extent, very little shortage of math teacher and a lot shortage of math teachers.

To enable deeper investigation in the future, I think it's necessary that we incoporate data source from the Government of Education and try to analyze the discrepancy of education resources and funds distributed across the nation. This report can serve as an evidence to the analysis of equalization in education in the future.

Nevertheless, I think the difference in the students' math performance can not serve as the evidence to tell if the students from a village or remote area is "poorly" educated. Since the assessment is standardized, it's actually unable to evaluate intangible attributes of a student, such as their EQ and their values. As far as I know, math is probably something that is comparetively irrelevant for students who comes from indigenous family and live their life in the outback. An assessment on hunting skill may seem more crucial to them.