Background and Data Cleaning

Back when I first started getting into programming and data analytics, I would often head to edX (https://www.edx.org/) to take some of their free courses and get a feel for things. Around the same time, I also learned about web scraping and how it could be used to build datasets directly from any website you were interested in (as long as you comply with robots.txt).

Now as a complete novice to the field, I thought that it would be pretty neat if I could build a web scraper that would run everytime I started up my laptop, and get the amount of learners on edX at any given day and compile it into an excel file.

edX.png-mh%20%281%29.png

And being a complete novice to the field, I decided to only scrape the current amount of learners as well as the date and then called it a day.

So that is how I left my poorly designed scraper to its own devices for about a year, only to realize that it has been broken for a couple months by the time a finally went back to check on it due to a change in edX's website (turns out BeautifulSoup4 can't read Javascript, the more you know I guess).

But coming back to the main topic, I decided that I might as well make use of the dataset my scraper has painstakingly compiled and attempt to find out just how much information we can obtain from just two columns. Let's start with some cleaning...

Well, that should do it for the data cleaning part of this analysis, now onto some actually exciting stuff.

Coincidentally, it looks like my scraper broke down right before its 300th scrape. This isn't particularly important to our analysis, but I thought it was neat (and also a bit depressing).

Exploratory Data Analysis (EDA)

Let's start off with a simple describe() call, at this point we can't really say much about the data yet, but these values should at least be a little interesting to look at.

The two following plots have been made to assist us in understanding the distribution of our dataset. Technically we could already have figured most of this out from the previous describe() call, but they help us visually affirm that details about the distribution of our dataset. Like how it appears to be centered around 470000 learners per day, and forms a Gaussian-like distribution.

Next up, let's construct a line chart of the amount of learners over time. While line charts of flunctuating data may not be the easiest to interpret (like say the stock market), they at least show us that the values in the dataset do vary quite a bit from day to day.

The following code just adds in a new column to our chart containing the the month extracted from each date recorded, which we then use to plot a bar chart of the average learners per month from our dataset. Other than February being a little lower than average, and July being non-existant (rip web scraper), the rest of the data is pretty uniform, nothing seems out of the ordinary here.

In a similar vein, we can plot out the average website visits to edX for each day in the week as we do below. Again, everything seems pretty standard here.

Could it be faked?

Our analysis so far has been pretty boring, revealing nothing really interesting about the dataset (kids, this is why you have more than two columns). So... to spice things up a little, why don't we run some tests to check if edX is lying or in some way manipulating the number of visitors to their site?

Companies being sneaky about their statistics is pretty common after all, let's see what insights data analytics can lend us on the validity of edX's numbers.

Note: I do not think that edX is being deceptive with their numbers nor is this analysis meant to cast any suspicion on them, this is just meant to be an exercise in statistical reasoning and testing. Don't sue me, thanks

Benford's Law

To start things off, we have Benford's Law (https://www.statisticshowto.com/benfords-law/). Which roughly states that a randomly generated set of numbers (or a set of numbers generated through a relatively random process) should have leading digits that form a very specific distribution.

There are other caveats and assumptions of this law, but for brevity's sake we will leave them out for now, if this interests you however do check out the link I provided.

512px-Rozklad_benforda.svg_.png The distribution of leading digits assuming they are randomly generated ^

Considering that our dataset is not randomly generated, after all it is supposed to represent the number of visitors to edX on any particular day, which should be subject to quite a few non-random factors such as edX's marketing efforts, consistent users like students and instructors, visits from partnered institions etc.

Therefore, the fact that this dataset does not conform to Benford's Law is actually a good thing in terms of proving its veracity. Unfortunately, Benford's Law works best on random datasets that span several orders of magnitude, which this dataset definitely isn't. So while Benford has given us a nice first gauge of the dataset's veracity, there is still much work to be done.

Randomness tests

Next up, we will try some assorted tests that are designed to look for the signs of any human hands messing with the dataset. We will start by splitting up our dataset into three parts and seeing if the mean remains roughly similar to the mean of the whole dataset.

Admittedly, the test above isn't the most convincing or statistically sound, but the reasoning is that it would be quite hard for a human to make it so that even arbitrarily split subsets of the data possess the same mean and standard deviation.

This still doesn't prove that the dataset isn't generated by sampling from a pre-defined probability distribution that has a mean and standard deviation similar to the ones we have obtained, but honestly I have no clue how I would even begin to prove or disprove such a claim. A quick Google search seems to show that most people don't either.

The following code cells split the 'Num of Learners' column in the dataset into its constituent digits and digit pairs respectively. This is because when humans attempt to make random numbers, they tend to leave in their own preferences for specific digits or digit pairs into their numbers.

Also note that I have removed the first number and first digit pair from both of the following charts, this is because as we have seen previously, the dataset is centered around the 400,000s and including the leading digits would skew the data towards 4

But, as you can see for the visualizations of both digits and digit pairs, the distributions of both numbers are pretty similar, and also pretty random.

Thus, once again our tests seem to imply that the dataset is unlikely to have been tampered with in any way by human hands, neat.

Conclusions

By and large, the following section serves to provide at least some justification that edX isn't faking their numbers, and when considering the little practical benefit edX would obtain from falsifying such a minor detail on their website, it seems we can be quite certain the numbers we have gathered are legitimate.

Beyond that, I hope you have enjoyed watching me contort myself and my poorly constructed dataset in order to present to you as much information about it as I possibly can. I hope this mess of a dataset that spawned from my naive mind a year ago was put to some entertaining use and that my poor web scraper may rest in piece.

Have a good one, cheers!