I decided to look at the College Scorecard data, but focus on the columns that have information about first-generation students in college. As a first-generation student of color, I have noticed many other first-generation students take time off and sometimes not return to Bowdoin. This observation inspired me to look at the completion rates of first-generation students at various types of institutions. I chose to look at Colleges and Universities that only give out Bachelor's degrees. I look at four year institutions to narrow down the data and get rid of vocational type of institutions and institutions that give out Associate's degrees. Given that many students do not graduate within four years, I look at how likely first-generation students are to graduate within six years.
First-generation students have to overcome many obstacles when transitioning into college. Since many first-generation students tend to come from a disadvantaged background, it is often difficult to complete college within four years. For low-income first-generation students, moving into college can be quite the cultural shock. The demands from home and the financial constraints are often difficult to balance in college. These added stresses can lead first-generation students to feel distressed, feel like they do not belong, and encourage them to give up on school all together. The majority of students' stress comes from financial constraints so some studies have predicted that making college more affordable could help increase the retention and completion rates among first-generation students. Despite this suggestion, first-generation students continue to face disadvantages that prevent them from completing college.
What states have the lowest rates of first-generation students that graduate with a bachelor's degree within six years?
I think it is logical to assume that bigger states such as Texas and California will have more first-generation students because they have a bigger pool of college-age students to look at. Having a bigger pool of students means that it is more difficult to have higher completion rates when compared to other states. My prediction is that big states like Texas, California, along with the border states of New Mexico and Arizona, will have higher rates of first-generation students graduate with a Bachelors degree within six years. On a similar note, I feel that smaller states such as North Dakota, The District of Columbia, Vermont, and Connecticut will have lower completion rates because they have a smaller population.
\1. The code below is loading the different packages that I will be using in my notebook. This is especially important for my visuals and merging the College Scorecard data with the States data.
library(ggplot2)
library(maps)
library(RColorBrewer)
library(ggplot2)
library(rgdal)
library(sp)
library(rgeos)
library(maptools)
\2. The code below creates a vector called states that uses the maps data and then shows us a table of the first six rows in the map data.
states <- map_data("state")
head(states)
\3. I created a logical vector called csc that is loading a new excel spreadsheet I created that contains the following column variables:
INSTNM = Institution Name
Region = Abbreviated State Name
CONTROL = 1 for a Public School, 2 for a Private nonprofit, 3 for a Private for-profit School
LATITUDE
LONGITUDE
UGDS_HISP = Total Share of Enrollment of Undergraduate Degree-Seeking Students who are Hispanic
FIRSTGEN_COMP_ORIG_YR6_RT = Percent of First-Generation Students who Completed Within 6 Years at Original Institution
FIRST_GEN = Share/ Percentage of First-Generation Students
HIGHDEG = 1 for a Certificate Degree, 2 for an Associates Degree, 3 for Bachelors Degree, and 4 for a Graduates Degree
REGION2 = 1 for New England (CT, ME, MA, NH, RI, VT), 2 Mid East (DE, DC, MD, NJ, NY, PA), 3 Great Lakes (IL, IN, MI, OH, WI), 4 Plains (IA, KS, MN, MO, NE, ND, SD), 5 Southeast (AL, AR, FL, GA, KY, LA, MS, NC, SC, TN, VA, WV), 6 Southwest (AZ, NM, OK, TX), 7 Rocky Mountains (CO, ID, MT, UT, WY), 8 Far West (AK, CA, HI, NV, OR, WA), 9 Outlying Areas (AS, FM, GU, MH, MP, PR, PW, VI)
csc <- read.csv("College_Data_FirstGen.csv", header = TRUE, stringsAsFactors = FALSE)
\4. The code below turns the abbreviated state names in the "region" column into the lowercase state names so that it can match with the "region" column in the map data.
#'x' is the column of a data.frame that holds 2 digit state codes
stateFromLower <-function(x) {
#read 52 state codes into local variable [includes DC (Washington D.C. and PR (Puerto Rico)]
st.codes<-data.frame(
state=as.factor(c("AK", "AL", "AR", "AZ", "CA", "CO", "CT", "DC", "DE", "FL", "GA",
"HI", "IA", "ID", "IL", "IN", "KS", "KY", "LA", "MA", "MD", "ME",
"MI", "MN", "MO", "MS", "MT", "NC", "ND", "NE", "NH", "NJ", "NM",
"NV", "NY", "OH", "OK", "OR", "PA", "PR", "RI", "SC", "SD", "TN",
"TX", "UT", "VA", "VT", "WA", "WI", "WV", "WY")),
full=as.factor(c("alaska","alabama","arkansas","arizona","california","colorado",
"connecticut","district of columbia","delaware","florida","georgia",
"hawaii","iowa","idaho","illinois","indiana","kansas","kentucky",
"louisiana","massachusetts","maryland","maine","michigan","minnesota",
"missouri","mississippi","montana","north carolina","north dakota",
"nebraska","new hampshire","new jersey","new mexico","nevada",
"new york","ohio","oklahoma","oregon","pennsylvania","puerto rico",
"rhode island","south carolina","south dakota","tennessee","texas",
"utah","virginia","vermont","washington","wisconsin",
"west virginia","wyoming"))
)
#create an nx1 data.frame of state codes from source column
st.x<-data.frame(state=x)
#match source codes with codes from 'st.codes' local variable and use to return the full state name
refac.x<-st.codes$full[match(st.x$state,st.codes$state)]
#return the full state names in the same order in which they appeared in the original source
return(refac.x)
}
\5. I created a new column in the csc data called "region" that uses the lowecase names of the states. I then print out the first ten state names for the region column in the csc data.
csc$region <- stateFromLower(csc$STABBR)
csc$region[1:10]
\6. I created a new vector below called csc_df that merges the csc and states data so that their region column is the same. I then print out the first six rows in a table of the new csc_df vector.
csc_df <- merge(csc, states, by = "region")
head(csc_df)
\7. The code below creates a new vector called csc2 that subsets the csc data by only including colleges that only give out Bachelors degrees. The head function prints out the first six rows of the subset of csc.
csc2 <- csc[csc$HIGHDEG == 3,]
head(csc2)
\8. Here I created a tx vector that only looks at colleges from csc2 that are in Texas. The s vector subsets the tx data by only looking at the columns listed below. The first six rows are show in the table below.
tx <- csc2$region == "texas"
tx2 <- csc2[csc2$CONTROL == 2,]
s <- csc2[tx,c("UGDS_HISP", "FIRST_GEN", "FIRSTGEN_COMP_ORIG_YR6_RT", "INSTNM", "CONTROL")]
head(s)
\9. I created a vector called complete that gets rid of the NAs and the non numeric values in the UGDS_HISP, FIRST_GEN, and the FIRSTGEN_COMP_IRIG_YR6_RT columns. I edit the s vector by using the vector called complete and then print the first six columns to check if I got rid of the nonnumeric values in the data.
complete <- complete.cases(cbind(as.numeric(s[,1]),as.numeric(s[,2]), as.numeric(s[,3], as.numeric(s[,4]))))
complete[1:5]
s <- s[complete, c("UGDS_HISP", "FIRST_GEN", "FIRSTGEN_COMP_ORIG_YR6_RT", "INSTNM", "CONTROL")]
head(s)
\10. Here I created a vector called cexValsthat repeat the size of the plotted values for every row in the csc2 data and I subset to look at schools in texas. The pchVals vector creates plus sign shapes of the plotted values for texas schools. The colVals vector creates light grey plots for the texas schools for all rows in the csc2 data.
cexVals <- rep(0.5, nrow(csc2))
cexVals[csc2$region == "texas"] = 1
pchVals <- rep(3, nrow(csc2))
pchVals[csc2$region == "texas"] = 19
colVals <- rep(grey(0.5), nrow(csc2))
colVals[csc2$region == "texas"] <- grey(0.1)
\11. Below I created two vectors to create a subset of the s vector that includes data for Texas colleges. Sub represents Public Texas colleges and sub2 represents Private forprofit Texas colleges.
sub <- s[s$CONTROL == 1, c("UGDS_HISP", "FIRST_GEN", "FIRSTGEN_COMP_ORIG_YR6_RT", "INSTNM", "CONTROL")]
head(sub)
sub2 <- s[s$CONTROL == 3, c("UGDS_HISP", "FIRST_GEN", "FIRSTGEN_COMP_ORIG_YR6_RT", "INSTNM", "CONTROL")]
head(sub2)
\12. Using the plot function, I created a scatterplot of the percentage of first-generation students against the percentage of first-generation students that complete a bachelors degree within six years at a private nonprofit college in Texas. I use the size, shape, and color established in the code above, I labeled the x and y-axis accordingly, labeled according to the names of the schools in Texas, and created a line with a slope of one. The points function creates red points for Public institutions in Texas and blue points for Private forprofit institutions.
plot(tx2$FIRST_GEN, tx2$FIRSTGEN_COMP_ORIG_YR6_RT, col=colVals, pch=pchVals, xlab="PercFirstGen", ylab="FirstGenComp6yr", main="First-Generation Students in Private Nonprofit Colleges in Texas")
text(as.numeric(s[,1]), as.numeric(s[,2]), as.numeric(s[,3])+0.001, labels = s$INSTNM, pos = 1, cex = 0.5)
abline(0,1)
points(sub$FIRST_GEN, sub$FIRSTGEN_COMP_ORIG_YR6_RT, col="red")
points(sub2$FIRST_GEN, sub2$FIRSTGEN_COMP_ORIG_YR6_RT, col="blue")
The scatterplot above shows us that Public Texas Colleges have the highest percentage of first-generation students at around 55% and 63%, but completion rates under 20%. Private forprofit Texas colleges also have a high percentage of first-generation students, but they have a relatively high completion rate for first-generation students ranging from 20%-70%.
\13. The code below creates a vector called logic that creates NA for values that are not a number. The perc vector uses the tapply function that does not include the NAs.
#pg46
logic <- is.na(csc2$FIRSTGEN_COMP_ORIG_YR6_RT)
perc <- tapply(as.numeric(csc2$FIRSTGEN_COMP_ORIG_YR6_RT[!logic]), INDEX=csc2$region, FUN=mean, na.rm=TRUE)
perc
\14. I created a new data frame called df_perc using the perc vector in the code above. Then I created a coloumn called region in the new data frame that includes the row names of df_perc and then I create a table of df_perc to see how the data frame looks.
df_perc <- as.data.frame(perc)
df_perc$region <- rownames(df_perc)
df_perc
\15. The logic2 vector below gets rid of the NAs in the perc column in df_perc. The perc column subsetting the logic2 vector changes the NA values to 0.
logic2 <- is.na(df_perc$perc)
df_perc$perc[logic2] <- 0
df_perc
\16. I checked the summary of the percent of first-generation students that complete college within 6 years variable. The hist function creates a histogram with twenty breaks with the x-axis labeled and the creation of a title.
summary(as.numeric(csc2$FIRSTGEN_COMP_ORIG_YR6_RT))
hist(as.numeric(csc2$FIRSTGEN_COMP_ORIG_YR6_RT), breaks=20, xlab= "Percent of First-Gen Students", main="First-Gen Completion Rates Within Six Years")
\17. The histogram above shows the spread of the percentage of first-generation students that graduate from college with a Bachelors degree within 6 years. The spread looks relatively normal. Here is a decription of what states are in each region
1 for New England (CT, ME, MA, NH, RI, VT)
2 Mid East (DE, DC, MD, NJ, NY, PA)
3 Great Lakes (IL, IN, MI, OH, WI)
4 Plains (IA, KS, MN, MO, NE, ND, SD)
5 Southeast (AL, AR, FL, GA, KY, LA, MS, NC, SC, TN, VA, WV)
6 Southwest (AZ, NM, OK, TX)
7 Rocky Mountains (CO, ID, MT, UT, WY)
8 Far West (AK, CA, HI, NV, OR, WA)
9 Outlying Areas (AS, FM, GU, MH, MP, PR, PW, VI)
ggplot(csc2, aes(x=factor(REGION2), y=as.numeric(FIRSTGEN_COMP_ORIG_YR6_RT), fill = factor(REGION2))) + geom_bar(stat='identity') +
labs(x="Region") +
labs(y="Count") +
labs(title="Total Number of First-Gen Students Who Complete College in the U.S.")
/18. The histogram above show that region 5 has the most number of first-generation students complete college within 6 years, while Region 9 has the least amount of first-generation students who complete college within 6 years. This is an interesting observation considering that Region 5 contains AL, AR, FL, GA, KY, LA, MS, NC, SC, TN, VA, and WV.
\19. The code below attempts to get rid of any negative values by setting any percentage less than 0 equal to 0. The creation of the interval vector cuts the perc column into four intervals and prints them below.
df_perc$perc[df_perc$perc<0] = 0
interval <- unique(cut(df_perc$perc, 4))
interval
\20. The next set of code creates breaks from df_perc$perc with the following labels accoriding the intervals creates above.
df_perc$breaks = cut(df_perc$perc, 4, labels = c("0-.132", ".132-.264", ".264-.396", ".396-.529"))
head(df_perc)
\21. chor_df is created to merge the states data with the df_perc data according to region and then prints the first six rows of the data.
choro_df <- merge(states, df_perc, by = "region")
head(choro_df)
\22. Next, choro is ordered and the first six rows are printed.
choro <- choro_df[order(choro_df$order), ]
head(choro)
\23. After the data is cleaned, we are finally ready to plot the data on a map. I used a qplot that uses the longitude and latitude of the choro data and fills the states according to the breaks created earlier. I create a title using main, I border each state so that it is easier to find states, and I use the Spectral palette to color states by various colors.
qplot(long, lat, data = choro, group = group, fill = breaks, geom = "polygon",
main = "College Completion Rates for First-Generation Students") + borders("state", size = 0.5) +
scale_fill_brewer(name = "College Completion", palette = "Spectral")
Red = Delaware
Orange = Washington, South Dakota, and Mississippi
Green = Montana, Idaho, Wyoming, North Dakota, Nevada, Utah, Colorado, New Mexico, Texas, Oklahoma, Kansas, Nebraska, Michigan, Maine, New York, Massachusetts, New Jersey, Maryland, Virgina, West Virginia, North Carolina, Tennessee, South Carolina, Georgia, Alabama, and Florida
Blue = Oregon, California, Arizona, Minnesota, Iowa, Missouri, Wisconsin, Illinois, Indiana, Kentucky, Ohio, Pennsylvania, Connecticut, Rhode Island, Vermont, and New Hampshire
I decided to focus my time on analyzing the red and orange states and looking into why states have rates between 0 and 26%. First-generation students tend to be racial minorities, and/or from a low-income family, and often headed by a single parent household. These characteristics make it more difficult for first-generaton students to complete college. Many first-generation students feel pressure to drop out of school because of family problems with money, stress and anxiety, a sense of not belonging, and off-campus employment. It is easier to get to the root of why completion rates for first-generation students, but it is difficult to look at why the low rates are specificly low in certain states.
After lookin closely at my data, Delaware does not have any colleges that give out Bachelors degrees. This could be the main reason why the state is seen to have the lowest rate of first-generation students completing college. As far as the oranges states that have completion rates between 13% and 26%, there is enough data in the College Scorecard data for 4-year institutions. The Robert B Miller College in Washington must have pulled the average completion rate at a rate of 53% while Seattle Central College has a completion rate less than 1%, but first-generation students make up 43% of the student population. In South Dakota, Presentation College has 30% of first-generation college students grduate from college. In Mississippi, one out of three colleges did not release information about the percentage of first-generation students who completed college, and Rust College has the lowest percentage of first-generation students to complete college at a rate of 15%. I definitely limited my data by only looking at 4-year institutions, but I think the pecentage averages of each states acurately express each state.
Boyd, Vivian S. Linda, K. Gast, Patricia F. Hunt, Alice Mitchell, and Wendy Wilson. "Why Some Students Leave College During Their Senior Year." Journal of College Student Development 53.5 (2012): 737-42. Web.
Riggs, Liz. "First-Generation College-Goers: Unprepared and Behind." The Atlantic, 31 Dec. 2014, http://www.theatlantic.com/education/archive/2014/12/the-added-pressure-faced-by-first-generation-students/384139/. Accessed 7 May 2017.
Wilbur, T. G., and V. J. Roscigno. "First-generation Disadvantage and College Enrollment/Completion." Socius: Sociological Research for a Dynamic World 2.0 (2016): 1-11. Web.
Wolfman-Arent, Avi. "First Year, First Generation: Overwhelmed by demands, buoyed by encouragement." newsworks, 28 Jun. 2016, http://www.newsworks.org/index.php/local/education/94947-first-year-first-generation-seans-spot. Accessed 7 May 2017.
Zinshteyn, Mikhail. "How to Help First-Generation Students Succeed." The Atlantic, 13 Mar. 2016, http://www.theatlantic.com/education/archive/2016/03/how-to-help-first-generation-students-succeed/473502/. Accessed on 7 May 2017.