Course Final Project

Instructions

The purpose of this project is to allow you to demonstrate each of the techniques you've learned in this course in a relevant context. You should begin by identifying an unstructured business problem you'd like to help solve and selecting a dataset that you can analyze to provide decision support for that problem. This could be something from your current job or a dataset you find online. Google Dataset Search and Kaggle.com are great resources for datasets and ideas for your final project.

Once you've decided on a business problem and accompanying dataset, proceed with each of the requirements below:

High-Level Requirements

  1. Project Documentation: the requirements for this may vary depending on whether you are completing a project from an actual company or not. However, the default for documentation is minimal and will be contained to a single Excel workbook (only one turned in for your entire team). Include the followoing worksheets (and more if you think of other things that would be useful:

    • Project Summary: the first worksheet should contain the names and email address of all project team members, the project title, and an abstract describing the motivation behind the project and the business problem that it helps to solve. For example:

      Abstract: One of the primary problems faced by Intermountain Healthcare is nurse scheduling. Although Intermountain can predict the number of patients who will arrive at their emergency room from day to day, they struggle to predict the number of nurse hours that will be required for each patient. Therefore, scheduling is a difficult problem because they often over- and under-schedule nurses. The purpose of this project is to build a calculator app that will predict the number of nurse hours requried for each patient. The output of this app will serve as an input into their existing scheduling system.

    • Web Service (optional): This is a worksheet that calls each of the predictive APIs that you create for this project. Your instructor will let you know if she or he wants you to complete this portion and what it should include

    • Data Visualizations: These may be created in Tableau or Excel. Include the most appropriate visualizations depicting the primary independent variable relationships with the dependent variable. The purpose of this sheet is to make it clear what the exact effects of each independent variable look like. For example, a regression coefficient of 0.35 indicates that the independent variable has a postive effect; but a visualization of that variable's relationship to the dependent variable helps us to understand whether that effect is consistent across all values of x.

    • Model Testing: This is where you will list every model you attempt (and the relevant fit statistics: R2, RMSE, Accuracy, AUC, NDCG, etc) and the variations in model hyperparameters. Clearly identify the final model(s) you chose for your API.

    • Findings: This should be a short, bullet-itemed write-up on what you found in your analyses. For example, which variables had the largest impact? What do you infer from that? Where there some unexpected findings? How would you explian them? This section should be 3-5 short paragraphs (one for each key point)

    • Data Dictionary: List your variables here. Explain where they came from and what work you had to do to collect them and clean them. Mark the variables that you had to engineer or reengineer for a better model. Basically, show all of your hard work that made your prediction possible.

  2. Tableau Story posted to Tableu Public (see instructions below): the URL for this story should be included in the documentation and submitted below

    • No minimum number of story points required, but it should be sufficient to reveal sevearl useful insights

    • Include at least one dashboard in the story that users can interact with by using filters

  3. Azure ML Studio experiment published to the ML Studio Gallery: the URL for this gallery experiment should be included in the documetnation and submitted below

    • At least two experiments should be included from the following four types: 1) regression, 2) classification, 3) text analysis, 4) matchbox recommender. However, depending on your unique context, you may create two types of matchbox recommenders or two regressions. Basically, you should show a breadth of skills, but it's more important that whatever you create is useful and not artificial for the purposes of the class. So talk to your instructor if you need some wiggle room on this requirement.

    • Evidence of sufficient model comparison and evaluation should be documented. This includes both 1) variable selection, and if appropriate, 2) algorithm selection

  4. Technical requirement: Depending on the technical expertise of the students (see Instructor for your exact requirments), you will also need to implement either web app or native mobile app that uses the Azure ML Experiment (see below for more details on this requirement) or you will need to add another worksheet to your Excel documentation file that calls the Azure ML Studio API and uses the prediction in some type of What-If Analysis or Excel Dashboard.

Detailed Requirements

  1. Data Preparation

    • Resolve missing data: generally speaking, columns and rows with more than 50% of the data missing will likely need to be eliminated altogether. For those with less than 50% missing, is the data numeric or categorical? For numeric data, consider using Azure ML Studio's Clean Missing Data feature. For categorical fields that a missing, consider one of two approaches: 1) if the data is missing because it was forgotten, also us the Clean Missing Data feature, or, 2) if the data is missing because there is no relevant value to use (e.g. consider the case where a product's "weight" is left blank; maybe that's because it's a downloadable software product and color is irrelevant), consider inputting a value like "blank" which will allow that record to be analyzed.

    • Convert data to a usable form: sometimes your data comes in formats that are unusable. For example, if you scrape product data from Amazon.com, you may get a price in the following format: < span class="currency" > $18.99 < /span > . In this case, you'll need to remove those unnecessary HTML tags before you can use the data.

    • Generate or collect new variables: Are there new variables that could be generated from existing variables that would be useful? For example, if you're using Amazon.com product data, the "category" field may come back in the form: "Home, Electronics, Tablets, Apple, iPad, iPad Pro". In that case, you may want to turn that variable into six variables to record the category, sub-category, ..., etc. Similarly, if you collect data about crimes across a variety of locations, you may want to look up additional information about each location like how close it is to a business or residence, weather patterns, or other data relevant to a location.

  2. Data Understanding (completed recursively with the Data Preparation stage)

    • Tableau: create a Tableau Story that includes at least one dashboard. This story should walk the user through all of the relevant charts (including forecasts where relevant and cluster analysis) that will describe what insights you can find from descriptive data analyses. There is no minimum number of story points required. However, your story--and your documentation--should adequately indicate why you created each visualization and what we can learn from it. You should post this story to Tableau Public and then copy the URL link to your Tableau Public story into your documentation. For instructions on how to do this, refer to this video:

    • Statistics: some of these can be calculated in Excel, some in Azure ML Studio, and some in both. This should include a correlation matrix for all numeric variables. You should identify strong correlations and provide a few bullet points on correlations of interest (e.g. Which independent variables are most correlated with the dependent variable? Which independent variables are possibly too highly correlated?). Next, you should calculate skewness and kurtosis for every numeric varialble. For those which violate the limits, include a histogram (you can generate these in the data view of Azure ML Studio). Try a variety of transformations to determine whether any of them will improve the skewness and kurtosis issues. If these problems can be fixed, include the newly transformed versions of the variables in your model training in the next phase. Also, if you are using a regression model, check for non-linear relationships. You can do this in Tableau or Excel. Create new variables to test if your examination reveals the possibility of exponential, logarithmic, or polynomial terms.

  3. Model Testing

    • Your project should include two of the four types of analyses taught in this class: 1) Regression, 2) Classification, 3) Text Analysis, 4) Matchbox Recommender

    • You should demonstrate a comprehensive set of testing across 1) various sets of independent variables, and if appropriate, 2) various algorithms.

    • You should demonstrate your reasoning for which independent variables you chose (e.g. Permutation Feature Analysis, Principal Components Analysis) and which algorithm you chose (R squared, Accuracy, NDCG, etc)

  4. Model Evaluation

    • You should report the evaluation results of every relevant model that you build and test. This will justify why you chose your final set of variables and the particular algorithm used. This section does not need to be particularly long. Just summarize and sort the results.

    • Your evaluation will vary depending on the types of analyses you used. For example, Regression models should inlcude at least R squared (coefficient of determination) and RMSE. Classification models should include at least accuracy and precision. Matchbox Recommender models should include at least RMSE and NDCG.

  5. Deployment

    • Your Azure ML Experiments should be deployed to the Azure ML Gallery

    • Your Tableau Stories and Dashboards should be published in a single workbook to Tableau Public. You can follow a video tutorial on how to perform this here:

Technical Requirement (you will only fulfill one of the two requirements below; see your Instructor for details)

  • Programming/Development Requirment (for those with programming backgrounds)

    1. Technical students will need to create a web app or native mobile app that the Tableau dashboard and ML Studio Web Service is used within in context.

    2. All projects need prior instructor approval. The web app's (or mobile app) high-level requirements should be pre-determined.

    3. Technical students do not necessarily need to create a Tableau Story. They can create a dashboard (or a set of dashboards) only. However, the dashboard(s) should be embedded into a web page that is part of the web application. In addition, the Tableau JavaScript API should be used to develop additional functionality that is incorporated with the dashboard. While not every technique that is taught in this course needs to be implemented, students should be creative in determining which features would be useful in their specific application.

    4. The Azure ML Studio Web Service(s) should be used (called) in the web app according to high-level requirements agreed upon by the instructor.

    5. The data source for the predictive model should be either an Azure SQL Server database or an Azure Blob to facilitate easy retraining of the model

    6. The app should also have some mechanism for storing new results into the same Azure SQL Server database that the model is trained from.

  • Excel Dashboard or What-If Analysis Requirement (for those with less-technical backgrounds)

    1. Create an additional worksheet in the same Excel workbook that you use to submit your project documentation

    2. Consider a use case where a prediction calculator would be beneficial as a decision-support tool.

    3. Create a professional-looking input form to send the independent variable data to the Azure ML Studio API you created.

    4. Format the returned prediction in some what that it would be useful. For example, you might consider creating a visualization that automatically displays the results of the prediction calculator.

    5. Be creative. There is a lot of room to complete this requirement "correctly" for full credit. Use your experience to help you devise a decision-support scenario where this prediction would be helpful.

Want to try our built-in assessments?


Use the Request Full Access button to gain access to this assessment.