Can Core ML in iOS Really Do Hot Dog Detection Without Server-side Processing?

Development iOS

Machine learning has quickly become an important bedrock for a variety of applications. Its mobile implementation, however, has been out of reach for many in the mobile app development community. The training and implementation processes for machine learning libraries require dedicated processing power, which is outside the purview of mobile devices. That processing power requirement and existing frameworks usually mean that a server-side component is necessary for even the smallest, machine learning-backed apps. Finally, training a machine learning model requires a good deal of knowledge that lies outside the normal developer spectrum.

Apple potentially solved those problems when it announced the release of Core ML at WWDC 2017. As noted in our review of iOS 11 updates, we were excited about the announcement of Core ML and have since dove deeper into the documentation to examine how Core ML makes machine learning accessible to any iOS developer, no training required. We decided to put Core ML to the test to see if it would accurately answer an odd, but important question: is this object a hot dog? Follow along as we explain what Core ML does right and demonstrate how to leverage machine learning in your own apps by using Core ML to detect whether something is, in fact, a hot dog.

How Core ML Solves Machine-Learning Problems for iOS Developers

Does Core ML herald the age of developer-accessible machine learning on mobile devices? No, but that was a trick question. Bite-sized machine learning (BSML anyone?) has actually been available to iOS developers since iOS 10, but it was more of a footnote. It was a good footnote though, providing two flavors of machine learning that gave developers a choice of utilizing either the CPU or the GPU based on their needs.

Core ML is a vast improvement over the two APIs offered in iOS 10. First off, it does away with having two APIs, which, in our book, is an 80% improvement right there. The framework decides which processor to use: the CPU, the GPU, or in the future, Apple's rumored AI chip.

Apple's chart of supported models and libraries.

Another big improvement to Apple's machine-learning offering comes with Core ML's ability to work with existing libraries. While mobile devices are powerful, using them to train a machine-learning model isn't quite practical. If you get annoyed at the slight wait when you see “Fetching new podcasts ...,” imagine how annoying seeing a spinner and “Training new neural network ...” would be.

So, if there's no way to train a model on a device, is Core ML basically a car without gas? Nope! Apple expanded the umbrella of Core ML to go beyond just offering a framework. Rather than trying to reinvent the wheel by creating their own full-fledged, machine-learning library, they opted to give developers the tools to utilize some of the existing libraries.

By using a Python package called Core ML Tools, you can convert a model trained in any supported library into Core ML's “.mlmodel” format. In addition to that, Apple went ahead and converted some pretty popular scene- and image-detection models that are ready for you to drop in your app and use. This means a developer can conceivably train their own model, then convert it to a Core ML model, and use it in their app. Even better, many open-source models are already available. So rather than training one from scratch, a developer can find said model and convert it from there.

How to Use Core ML in Your iOS Apps

In order to demonstrate just how to leverage these new Core ML powers, we whipped up a quick demo app that detects whether or not a provided image is, indeed, of a hot dog. Why a hot dog? Well, ever since HBO's Silicon Valley introduced the “Not Hotdog” app in their show, classifying images as hot dogs or not has quickly surpassed MNIST as the “Hello World” for machine learning, um, learning. So let's jump in.

If You Want to Create and Train Your Own Core ML Model

If you choose to go down this path, check out this Above Intelligent post to get an overview of how to do so in TensorFlow, Google's machine learning library. Keep in mind that TensorFlow's models are not, at this time, supported by Core ML Tools. Then head over to this Keras Blog post to find out how to implement said model in Keras. Keras is a neural network API that gives developers the choice of which machine learning framework powers the training and implementation of their model: Google's TensorFlow, Microsoft's CNTK, or Theano. Stick with the first example in that post, as the other method is an implementation of a pre-trained model that Apple already converted to .mlmodel. Note that the code in the Keras blog post is aimed at Keras 2, but Core ML Tools currently only supports Keras 1.2.2. While Keras 2 support is in the works, you'll have to go here for the pre-Keras 2 code for now.

Choosing and Adding an Existing Core ML Model

As for the rest of you, all four of Apple's currently offered models deal with image classification, so the hard part is choosing which one to use. One of them, Places205-GoogLeNet, is disqualified, as it detects the overall scene of the photo instead of individual objects, and it has yet to be trained to detect a field of hot dogs. So we're down to three options: ResNet50, Inception v3, or VGG16. All three are similar in their methodology. Given an image, each model has 1,000 image categories and, based on the data it was trained on, determines the probabilities of the image belonging to each of those categories. The image is then matched to the category with the highest probability.

In addition to the similar methodology, all three of the models were designed with the true purpose of machine learning in mind: the identification of hot dogs. All three of Apple's .mlmodel versions of the models use the same 1,000 labels, which, of course, include hot dogs.

Given that we have three models that do what we're looking for, which one should we go with? Well, seeing as we eventually want this end up on a mobile device, let's go with the smallest of the three - Inception v3.

Having chosen our model and downloaded it, let's create a new iOS app in Xcode 9. Drag and drop the Inceptionv3.mlmodel file into your project. Now let's create a new Swift class. We'll call it “Predictor” and it will look like this:

import Foundation
import CoreML
import UIKit

class Predictor {

    static let model = Inceptionv3()
    
    static func isThisAHotDog(image: UIImage) -> Bool {
        if let resizedImage = image.scaleImage(newSize: CGSize(width: 299.0, height: 299.0)), let pixelBuffer = resizedImage.buffer(), let prediction = try? model.prediction(image: pixelBuffer) {
            return prediction.classLabel == "hotdog, hot dog, red hot"
        }
        return false
    }
    
}

Awesome, we're ready to go! Oh wait, you're wondering about that whole scaleImage and buffer UIImage functions. Let's create that UIImage extension first, and then explain why we need it. So create a new Swift file and add the following code:

import Foundation
import UIKit

extension UIImage {
    
    func scaleImage(newSize: CGSize) -> UIImage? {
        UIGraphicsBeginImageContextWithOptions(newSize, false, 0.0)
        self.draw(in: CGRect(x: 0, y: 0, width: newSize.width, height: newSize.height))
        let newImage = UIGraphicsGetImageFromCurrentImageContext()
        UIGraphicsEndImageContext()
        return newImage
    }
    
    func buffer() -> CVPixelBuffer? {
        let attrs = [kCVPixelBufferCGImageCompatibilityKey: kCFBooleanTrue, kCVPixelBufferCGBitmapContextCompatibilityKey: kCFBooleanTrue] as CFDictionary
        var pixelBuffer : CVPixelBuffer?
        let status = CVPixelBufferCreate(kCFAllocatorDefault, Int(self.size.width), Int(self.size.height), kCVPixelFormatType_32ARGB, attrs, &pixelBuffer)
        guard (status == kCVReturnSuccess) else {
            return nil
        }
        
        CVPixelBufferLockBaseAddress(pixelBuffer!, CVPixelBufferLockFlags(rawValue: 0))
        let pixelData = CVPixelBufferGetBaseAddress(pixelBuffer!)
        
        let rgbColorSpace = CGColorSpaceCreateDeviceRGB()
        let context = CGContext(data: pixelData, width: Int(self.size.width), height: Int(self.size.height), bitsPerComponent: 8, bytesPerRow: CVPixelBufferGetBytesPerRow(pixelBuffer!), space: rgbColorSpace, bitmapInfo: CGImageAlphaInfo.noneSkipFirst.rawValue)
        
        context?.translateBy(x: 0, y: self.size.height)
        context?.scaleBy(x: 1.0, y: -1.0)
        
        UIGraphicsPushContext(context!)
        self.draw(in: CGRect(x: 0, y: 0, width: self.size.width, height: self.size.height))
        UIGraphicsPopContext()
        CVPixelBufferUnlockBaseAddress(pixelBuffer!, CVPixelBufferLockFlags(rawValue: 0))
        
        return pixelBuffer
    }
    
}
Understanding the Core ML Model

The pixel buffer is the easier piece, so it makes sense to explain that first even though it occurs after the resizing step. Basically, the prediction function for the model uses Core Video to do the magic of processing the image in question at the pixel level.

Keep in mind that each of the models was trained on a specific size of image, and Inception v3 expects an image that's 299x299. But what if our image isn't 299x299 and is 1000x1000 or not even square, like a photo from, oh, say an iPhone? Wouldn't squishing what was once a delicious-looking hot dog make our model fail to recognize it? The short answer is “no.” The somewhat longer (but still short answer, given the subject) is, not really.

Basically, the model processes the image by converting the image into a three-dimensional array. Think of it as three spreadsheets with each sheet representing red, green, and blue. Each of those spreadsheets is 299 columns by 299 rows, so each cell is a pixel location of the image, with the cell's value being the amount of that color in that pixel. Even though the image may look weird by the resizing, the model isn't actually looking at the image of the hot dog. Instead, it's examining the pixel values. When the model was trained (which was also done on images that were squished and stretched from their original sizes into 299x299), it determined on the pixel level what features and thresholds increased or decreased the probabilities of an image's categorical classification. It applies what it learned during training, again on the pixel level, to process new images.

Using and Modifying Your Predictor Class

Now you have your Predictor class that you can use in the sample app you created. You can also drop it into any other app (with the Inception v3 model) and feed images from any of the various methods available on a user's device. Test it by copying a photo into your project and adding:

if let image = UIImage.init(named: "your_hot_dog_photo.jpg") {
            print(Predictor.isThisAHotDog(image: image))
}

to your AppDelegate in didFinishLaunchingWithOptions.

So, conceivably, you could modify this to classify a photo as one of the other 999 categories that Inception v3 has been trained for. But why would you want to? Just to follow that outrageous thought through to completion, you would modify the return type from Bool to an optional String and return prediction.classLabel.

Making a More Effective Model

We've thrown quite a few hot dog images at our Predictor, and so far, it's been pretty solid at identifying the average hot dog. Add a few unfamiliar condiments like jalapeños or tomatoes to your hot dog though, and the model gets confused. If we need our model to handle the many varieties of hot dogs out there, we could use Keras or Caffe to train our own classifier and use Core ML Tools to convert it to .mlmodel. What this all boils down to is that Core ML allows you to easily implement machine learning models into your iOS app. Your app's effectiveness is dependent on the strength of the model you use. The more trained and tuned the model, the better your results with Core ML will be.

Concluding Note

As you can see Apple has set a high bar for mobile device-based machine learning with Core ML. We expect something equally as cool from Google as well as Facebook. Google hasn't sat idly, as evidenced by TensorFlow and the announcement of their entry into BSML with TensorFlow Lite. And Facebook's Caffe2Go is supposed to be aimed at both iOS and Android. Unfortunately the last time we heard about Caffe2go was a Facebook announcement in November 2016. Fingers-crossed emojis anyone? It will be interesting to see what machine-learning progress Apple, Google, and Facebook make in the next year or two.

Let us know how you take advantage of Core ML in your own apps by contacting us on Twitter or via our website. You can find out more about what developers should know about WWDC 2017 in our resource 11 Considerations to Update Your App for iOS 11.

Jaz is an app developer who enjoys creating tasteful interactions, fleets of world-dominating bots, and....puns.

You made it this far so...