You are here

Technology

Four short links: 17 November 2017

O'Reilly Radar - Fri, 2017/11/17 - 02:00

Interactive Marginalia, In-Person Interactions, Welcoming Groups, and Systems Challenges

  1. Interactive Marginalia (Liza Daly) -- wonderfully thoughtful piece about web annotations.
  2. In-Person Interactions -- Casual human interaction gives you lots of serendipitous opportunities to figure out that the problem you thought you were solving is not the most important problem, and that you should be thinking about something else. Computers aren't so good at that. So true! (via Daniel Bachhuber)
  3. Pacman Rule -- When standing as a group of people, always leave room for 1 person to join your group. (via Simon Willison)
  4. Berkeley View of Systems Challenges for AI -- In this paper, we propose several open research directions in systems, architectures, and security that can address these challenges and help unlock AI’s potential to improve lives and society.

Continue reading Four short links: 17 November 2017.

Categories: Technology

Four short links: 16 November 2017

O'Reilly Radar - Thu, 2017/11/16 - 13:00

Regulate IoT, Visualize CRISPR, Distract Strategically, and Code Together

  1. It's Time to Regulate IoT To Improve Security -- Bruce Schneier puts it nicely: internet security is now becoming "everything" security.
  2. Real-Space and Real-Time Dynamics of CRISPR-Cas9 (Nature) -- great visuals, written up for laypeople in The Atlantic. (via Hacker News)
  3. How the Chinese Government Fabricates Social Media Posts for Strategic Distraction, not Engaged Argument -- research paper. Application to American media left as exercise to the reader.
  4. Coding Together in Real Time with Teletype for Atom -- what it says on the box.

Continue reading Four short links: 16 November 2017.

Categories: Technology

The tools that make TensorFlow productive

O'Reilly Radar - Thu, 2017/11/16 - 07:25

Analytical frameworks come with an entire ecosystem.

Deployment is a big chunk of using any technology, and tools to make deployment easier have always been an area of innovation in computing. For instance, the difficulties and uncertainties of installing software and keeping it up-to-date were one factor driving companies to offer software as a service over the Web. Likewise, big data projects present their own set of issues: how do you prepare and ingest the data? How do you view the choices made by algorithms that are complex and dynamic? Can you use hardware acceleration (such as GPUs) to speed analytics, which may need to operate on streaming, real-time data? Those are just a few deployment questions associated with deep learning.

In the report Considering TensorFlow for the Enterprise, authors Sean Murphy and Allen Leis cover the landscape of tools for working with TensorFlow, one of the most popular frameworks currently in big data analysis. They explain the importance of seeing deep learning as an integral part of a business environment—even while acknowledging that many of the techniques are still experimental—and review some useful auxiliary utilities. These exist for all of the major stages of data processing: preparation, model building, and inference (submitting requests to the model), as well as debugging.

Given that the decisions made by deep learning algorithms are notoriously opaque (it's hard to determine exactly what combinations of features led to a particular classification), one intriguing part of the report addresses the possibility of using TensorBoard to visualize what's going on in the middle of a neural network. The UI offers you a visualization of the stages in the neural network, and you can see what each stage sends to the next. Thus, some of the mystery in deep learning gets stripped away, and you can explain to your clients some of the reasons that a particular result was reached.

Another common bottleneck for many companies stems from the sizes of modern data sets, which often beg for help in getting ingested and through the system. One study found that about 20% of businesses handle data sets in the range of terabytes, with smaller ranges (gigabytes) being most common, and larger ones (petabytes) quite rare. For that 20% or more using unwieldy data sets, Murphy and Leis’s report is particularly valuable because special tools can help tie TensorFlow analytics to the systems that pass data through its analytics, such as Apache Spark. The authors also cover options for hardware acceleration: a lot of research has been done on specialized hardware that can accelerate deep learning even more than GPUs do.

The essential reason for using artificial intelligence in business is to speed up predictions. To reap the most benefit from AI, therefore, one should find the most appropriate hardware and software combination to run the AI analytics. Furthermore, you want to reduce the time it takes to develop the analytics, which will allow you to react to changes in fast-moving businesses and reduce the burden on your data scientists. For many reasons, understanding the tools associated with TensorFlow makes its use more practical.

This post is part of a collaboration between O'Reilly and TensorFlow. See our statement of editorial independence.

Continue reading The tools that make TensorFlow productive.

Categories: Technology

Implementing The Pipes and Filters Pattern using Actors in Akka for Java

O'Reilly Radar - Thu, 2017/11/16 - 06:00

How messages help you decouple, test, and re-use your software’s code.

We would like to introduce a couple of interesting concepts from Akka by giving an overview of how to implement the pipes and filters enterprise integration pattern. This is a commonly used pattern that helps us flexibly compose together sequences of alterations to a message. In order to implement this pattern we use Akka - a popular library that provides new approaches to write modern reactive software in Java and Scala.

The Business problem

Recently we came across an author publishing application made available as a service. It was responsible for processing markdown text. It would execute a series of operations back to back:

Continue reading Implementing The Pipes and Filters Pattern using Actors in Akka for Java.

Categories: Technology

Nathaniel Schutta on succeeding as a software architect

O'Reilly Radar - Thu, 2017/11/16 - 05:10

The O’Reilly Programming Podcast: The skills needed to make the move from developer to architect.

In this episode of the O’Reilly Programming Podcast, I talk with Nathaniel Schutta, a solutions architect at Pivotal, and presenter of the video I’m a Software Architect, Now What?. He will be giving a presentation titled Thinking Architecturally at the 2018 O’Reilly Software Architecture Conference, February 25-28, 2018, in New York City.

Continue reading Nathaniel Schutta on succeeding as a software architect.

Categories: Technology

Modern HTTP service virtualization with Hoverfly

O'Reilly Radar - Thu, 2017/11/16 - 04:00

Service virtualization brings a lightweight, automatable means of simulating external dependencies.

In modern software systems, it’s very common for applications to depend on third party or internal services. For example, an ecommerce site might depend on a third party payment service to process card payments, or a social network to provide authentication. These sorts of applications can be challenging to test in isolation, as their dependencies can introduce problems like:

  • Non-determinism
  • Slow and costly builds
  • Unmockable client libraries
  • Rate-limiting
  • Expensive licensing costs
  • Incompleteness
  • Slow provisioning

To get around this, service virtualization, or replacing these components with a process which simulates them, can emulate these dependencies. Unlike mocking, which replaces your application code, service virtualization lives externally, typically operating at the network level. It is non-invasive, and is essentially just like the real thing from the perspective of its consumer.

Continue reading Modern HTTP service virtualization with Hoverfly.

Categories: Technology

Four short links: 15 November 2017

O'Reilly Radar - Wed, 2017/11/15 - 04:00

Paywalled Research, Reproducing AI Research, Spy Teardown, and Peer-to-Peer Misinformation

  1. 65 of the 100 Most-Cited Papers Are Paywalled -- The weighted average of all the paywalls is: $32.33 [...] [T]he open access articles in this list are, on average, cited more than the paywalled ones.
  2. AI Reproducibility -- Participants have been tasked with reproducing papers submitted to the 2018 International Conference on Learning Representations, one of AI’s biggest gatherings. The papers are anonymously published months in advance of the conference. The publishing system allows for comments to be made on those submitted papers, so students and others can add their findings below each paper. [...] Proprietary data and information used by large technology companies in their research, but withheld from papers, is holding the field back.
  3. Inside a Low-Budget Consumer Hardware Espionage Implant -- The S8 data line locator is a GSM listening and location device hidden inside the plug of a standard USB data/charging cable. Has a microphone but no GPS, remotely triggered via SMS messages, uses data to report cell tower location to a dodgy server...and is hidden in a USB cable.
  4. She Warned of ‘Peer-to-Peer Misinformation.’ Congress Listened (NY Times) -- Renee's work on anti-vaccine groups (and her college thesis on propaganda in the 2004 Russian elections) led naturally to her becoming an expert on Russian propaganda in the 2016 elections.

Continue reading Four short links: 15 November 2017.

Categories: Technology

Scaling messaging in Go network clients

O'Reilly Radar - Wed, 2017/11/15 - 04:00

Learn how the NATS client implements fast publishing and messages processing schemes viable for production use.

The previous article in this series created a client that communicated with a server in a simple fashion. This article shows how to add features that make the client more viable for production use. Problems we’ll solve include:

  1. Each message received from the server will block the read loop while executing the callback that handles the message, because the loop and callback run in the same goroutine. This also means that we cannot the implement Request() and Flush() methods that the NATS Go client offers.
  2. All publish commands are triggering a flush to the server and blocking when doing so, impacting performance.

We’ll fix these problems in this article. The third and last section of the article will build on the client we create here to build in Request/Response functionality for one-to-one communication. Other useful functionality that is not covered in this series, but that a production client should have, include:

Continue reading Scaling messaging in Go network clients.

Categories: Technology

5 tips for driving design thinking in a large organization

O'Reilly Radar - Tue, 2017/11/14 - 05:00

How user-centered design focused on user needs and delivery can bring about real change and still be respected in the boardroom.

Continue reading 5 tips for driving design thinking in a large organization.

Categories: Technology

C++17 upgrades you should be using in your code

O'Reilly Radar - Tue, 2017/11/14 - 04:00

Structured bindings, new library types, and containers add efficiency and readability to your code.

C++17 is a major release, with over 100 new features or significant changes. In terms of big new features, there's nothing as significant as the rvalue references we saw in C++11, but there are a lot of improvements and additions, such as structured bindings and new container types. What’s more, a lot has been done to make C++ more consistent and remove unhelpful and unnecessary behavior, such as support for trigraphs and std::auto_ptr.

This article discusses two significant C++17 upgrades that developers need to adopt when writing their own C++ code. I’ll explore structured bindings, which is a useful new way to work with structured types, and then some of the new types and containers that have been added to the Standard Library.

Continue reading C++17 upgrades you should be using in your code.

Categories: Technology

Four short links: 14 November 2017

O'Reilly Radar - Tue, 2017/11/14 - 04:00

AI Microscope, Android Geriatrics, Doxing Research, and Anti-Goals

  1. AI-Powered Microscope Counts Malaria Parasites in Blood Samples (IEEE Spectrum) -- The EasyScan GO microscope under development would combine bright-field microscope technology with a laptop computer running deep learning software that can automatically identify parasites that cause malaria. Human lab workers would mostly focus on preparing the slides of blood samples to view under the microscope and verifying the results. Currently 20m/slide (same as a human), but they want to cut it to 10m/slide.
  2. A Billion Outdated Android Devices in Use -- never ask why security researchers drink more than the rest of society.
  3. Datasette (Simon Willison) -- instantly create and publish an API for your SQLite databases.
  4. Fifteen Minutes of Unwanted Fame: Detecting and Characterizing Doxing -- This work analyzes over 1.7 million text files posted to pastebin.com, 4chan.org, and 8ch.net, sites frequently used to share doxes online, over a combined period of approximately 13 weeks. Notable findings in this work include that approximately 0.3% of shared files are doxes, that online social networking accounts mentioned in these dox files are more likely to close than typical accounts, that justice and revenge are the most often cited motivations for doxing, and that dox files target males more frequently than females.
  5. The Power of Anti-Goals (Andrew Wilkinson) -- instead of exhausting aspirations, focus on avoiding the things that deplete your life. (via Daniel Bachhuber)

Continue reading Four short links: 14 November 2017.

Categories: Technology

Four short links: 13 November 2017

O'Reilly Radar - Mon, 2017/11/13 - 05:30

Software 2.0, Watson Walkback, Robot Fish, and Smartphone Data

  1. Software 2.0 (Andrej Karpathy) -- A large nimber of programmers of tomorrow do not maintain complex software repositories, write intricate programs, or analyze their running times. They collect, clean, manipulate, label, analyze, and visualize data that feeds neural networks. Supported by Pete Warden: I know this will all sound like more deep learning hype, and if I wasn’t in the position of seeing the process happening every day, I’d find it hard to swallow too, but this is real. Bill Gates is supposed to have said "Most people overestimate what they can do in one year and underestimate what they can do in 10 years," and this is how I feel about the replacement of traditional software with deep learning. There will be a long ramp-up as knowledge diffuses through the developer community, but in 10 years, I predict most software jobs won’t involve programming. As Andrej memorably puts it, “[deep learning] is better than you”!
  2. IBM Watson Not Even Close -- The interviews suggest that IBM, in its rush to bolster flagging revenue, unleashed a product without fully assessing the challenges of deploying it in hospitals globally. While it has emphatically marketed Watson for cancer care, IBM hasn’t published any scientific papers demonstrating how the technology affects physicians and patients. As a result, its flaws are getting exposed on the front lines of care by doctors and researchers who say that the system, while promising in some respects, remains undeveloped. AI has been drastically overhyped, and there will be more disappointments to come.
  3. Robot Spy Fish -- “The fish accepted the robot into their schools without any problem,” says Bonnet. “And the robot was also able to mimic the fish’s behavior, prompting them to change direction or swim from one room to another.”
  4. Politics Gets Personal: Effects of Political Partisanship and Advertising on Family Ties -- Using smartphone-tracking data and precinct-level voting, we show that politically divided families shortened Thanksgiving dinners by 20-30 minutes following the divisive 2016 election.[...] we estimate 27 million person-hours of cross-partisan Thanksgiving discourse were lost in 2016 to ad-fueled partisan effects Smartphone data is useful data. (via Marginal Revolution)

Continue reading Four short links: 13 November 2017.

Categories: Technology

“Not hotdog” vs. mission-critical AI applications for the enterprise

O'Reilly Radar - Mon, 2017/11/13 - 05:00

Drawing parallels and distinctions around neural networks, data sets, and hardware.

Artificial intelligence has come a long way since the concept was introduced in the 1950s. Until recently, the technology had an aura of intrigue, and many believed its place was strictly inside research labs and science fiction novels. Today, however, the technology has become very approachable. The popular TV show Silicon Valley recently featured an app called “Not Hotdog,” based on cutting-edge machine learning frameworks, showcasing how easy it is to create a deep learning application.

Gartner has named applied AI and machine-learning-powered intelligent applications as the top strategic technology trend for 2017, and reports that by 2020, 20% of companies will dedicate resources to AI. CIOs are under serious pressure to commit resources to AI and machine learning. It is becoming easier to build an AI app like Not Hotdog for fun and experimentation, but what does it take to build a mission-critical AI application that a CIO can trust to help run a business? Let’s take a look.

For the purpose of this discussion, we will limit our focus to applications similar to Not Hotdog, (i.e., applications based on image recognition and classification), although the concepts can be applied to a wide variety of deep learning applications. We will also limit the discussion to systems and frameworks, because personnel requirements can vary significantly based on the application. For example, for an image classification application built for retinal image classification, Google required the assistance of 54 ophthalmologists. Whereas for an application built for recognizing dangerous driving, we are going to require significantly less expertise and fewer people.

Image classification: Widely applicable deep learning use case

At its core, Not Hotdog is an image classification application. It classifies images into two categories: “hotdogs” and “not hotdogs.”

Figure 1. Screenshot from the “Not Hotdog” app courtesy of Ankur Desai.

Image classification has many applications across industries. In health care, it can be used for medical imaging and diagnosing diseases. In retail, it can be used to spot malicious activities in stores. In agriculture, it can be used to determine the health of crops. In consumer electronics, it can provide face recognition and autofocus to camera-enabled devices. In the public sector, it can be used to identify dangerous driving with traffic cameras. The list goes on. The fundamental difference between these applications and Not Hotdog is the core purpose of the application. Not Hotdog is intentionally meant to be farcical. As a result, it is an experimental app. However, the applications listed above are meant to be critical to core business processes. Let’s take a look at how “Not Hotdog” is built, and then we will discuss additional requirements for mission-critical deep learning applications.

Not Hotdog: How is it built?

This blog takes us through the wonderful journey of Not Hotdog’s development process. Following is the summary of how it is built.

Not Hotdog uses the following key software components:

  • React Native: An app development framework that makes it easy to build mobile apps.
  • TensorFlow: An open source software library for machine learning. It makes building deep learning neural networks easy with pre-built libraries.
  • Keras: An open source neural network library written in Python. It is capable of running on top of TensorFlow and other machine learning libraries. It presents higher-level abstractions that make it easy to configure neural networks.

The following deep neural networks were considered:

  • Inception model: The Inception v3 model is a deep convolutional neural network pre-trained with Google’s ImageNet data set of 14 million images. It gives us a pre-trained image classification model. The model needs “re-training” for specific tasks, but overall, it makes it really easy to build an image classification neural network. This was abandoned due to imbalance in training data and the massive size of the model.
  • SqueezeNet: A distributed deep neural network for image classification. Primary benefits over the Inception model are that it requires 1/10th of the parameters, it is much faster to train, and the resulting model is much smaller and faster (it often requires 1/10th of the memory). SqueezeNet also has a pre-trained Keras model, making it easier to retrain for specific requirements. SqueezeNet was abandoned due to accuracy concerns.
  • MobileNets (the chosen one): Efficient convolutional neural networks for mobile vision applications. It helps create efficient image classification models for mobile and embedded vision applications. The Not Hotdog team ended up choosing this neural network after strongly considering the two above.

The Not Hotdog team used an extremely economical hardware setup:

  • Laptop (yes, a single laptop!): 2015 MacBook Pro (used for training).
  • GPU: Nvidia GTX 980 Ti, attached to the laptop (used for training).
  • Users’ phones: For running the model. This is a simple example of edge computing where computing happens right at the edge, without relying on a “server.”

And leveraged the following data set:

The training data set consisted of 150K images. Three thousand images were hotdogs, while 147K images were not hotdogs. During model training, these images were processed 240 times, which took 80 hours on the very modest hardware setup. The final model was trained only once. There was no feedback loop for retraining the model for incremental improvements.

Mission-critical AI: Drawing parallels with Not Hotdog

Based on the information above, let’s use a similar construct to understand the requirements for creating AI-enabled core business applications.

The Not Hotdog team selected React Native, which falls under the category of cross-platform native mobile application frameworks. If you are building a mobile app, it is a good idea to investigate this category. Native frameworks provide the flexibility of cross-platform frameworks, at the same time removing unnecessary components (i.e., the mobile browser) from the architecture. As a result, the applications run smoother and faster. A downside of native frameworks is a limited feature set. If you are building a mission-critical web application powered by AI, there are plenty of battle-tested frameworks — such as Django, Ruby on Rails, Angular, React — for back-end and front-end development. One attribute to always check for is the integration with your chosen deep learning libraries.

Another key software component selected by the Not Hotdog team was Keras running on top of TensorFlow. Keras, as mentioned above, is capable of running on top of TensorFlow, CNTK, or Theano. Keras focuses on enabling fast experimentation. One can think of Keras as a better UX for TensorFlow, enabling efficient development of models. Keras also provides the flexibility of choosing the desired deep learning library underneath. Bottomline: Keras provides a better developer experience. However, there are many other deep learning libraries available.

Choosing the right deep neural network for image classification

The Not Hotdog team ended up selecting MobileNets after significant experimentation with Inception model v3 and SqueezeNet. There are a number of deep neural networks available to choose from if you are building an AI-enabled application. More information can be found in the research paper by Alfredo Canziani, Eugenio Culurciello, Adam Paszke, “An Analysis of Deep Neural Network Models for Practical Applications.” If you are planning to build an AI application that utilizes image classification, this paper is a must read—it contains interesting figures that offer a high-level indication about the computing resources required by each neural network, number of parameters used by each neural network, and accuracy-per-parameter (a measure of efficiency) for each neural network.

Key considerations for choosing the deep neural network include: accuracy, number of parameters, processing times for the training set and input images, resulting model size, availability of the pre-trained neural network with the library of choice, and memory utilization. The optimal neural network will depend on your use case. It is a good idea to test a few different neural networks to see which one meets expectations.

Data sets—quantity and quality matters

There are many considerations for data sets to use as you build and deploy your AI application.

  • Training data: A model is only as good as the data it is trained on. For mission-critical AI applications, accuracy is much more important than Not Hotdog. The implications of failing to recognize skin cancer are far more serious. As a result, we are going to require data sets that are significantly greater in quantity and quality. The Not Hotdog team trained their model using 150K images. It is likely that we will require data sets much larger to achieve the desired accuracy. Also, the quality of the data is equally important, which means we may have to spend significant time preparing and labeling the data.
  • Input data: The input data for prediction using our trained model can also exponentially exceed the amount of data Not Hotdog has to process. Imagine a live camera feed from tens of cameras in a store. We may have to process thousands of images per minute to detect malicious activities. A retail chain may have thousands of outlets. We are talking about around a million images per minute! It is exponentially larger compared to Not Hotdog, which may have to process a few images per minute, for which it uses the mobile device of its users.
  • Feedback loop: It is a great idea to provide a feedback loop to improve the model. In the retail store example, if we found false positive or false negative predictions, it would be a good idea to label them and feed them back into the model to retrain it. A continuous feedback loop means the training data size increases significantly over time, and the compute requirements are also significant as we train the model multiple times using additional data.
Hardware setup for production-ready AI applications

The hardware used for running mission-critical applications will depend on SLAs and the massive differences in data set sizes:

  • Storage: It is highly unlikely our data set will fit on the disk of a laptop. We will require a number of SSDs and/or HDDs to store the ever-increasing training data.
  • Compute (training): It is extremely likely we will need a cluster of GPUs to train a model with a training data set consisting of millions of images.
  • Compute (running the model/application): Once a model is trained, depending on size and performance of the model, number of input images, and business SLA expected from the application, we are going to need a cluster of GPUs to meet the application requirements.
Mission-critical AI: Key differentiators from Not Hotdog

The key differences between a mission-critical AI application and Not Hotdog are accuracy, SLAs, scale, and security:

  • Accuracy: Mission-critical AI applications will need significantly better accuracy. This means the training data needs to be greater in quality and quantity.
  • SLAs: Besides accuracy, an application used for core business processes will have strict SLAs. For example, when an image is fed to a model for prediction, the model should not take more than one second to predict the output. The architecture has to be scalable to accommodate the SLAs.
  • Scale: High-accuracy expectation and strict SLAs mean we need an architecture that can effortlessly scale-out, for both compute and storage.
  • Security: Certain applications present security and data privacy concerns. The data may include sensitive information and has to be stored securely with enterprise-grade security features to make sure only authorized users have access to the information.
Architectural considerations for mission-critical AI: Distributed computing and storage

To meet the accuracy and SLA requirements, consideration should be made for an architecture that can easily scale-out, for both compute and storage. Distributed computing-based big data platforms may present a viable architecture, summarized below, for scaling AI to meet business requirements.

Distributed storage: Big data stores built using a distributed files system, such as the MapR Converged Data Platform and Apache Hadoop, can provide a scalable and economical storage layer to store millions of images.

Distributed compute: Containerization is a viable architecture for enabling mission-critical AI using distributed computing. Harnessing the models, metadata, and dependencies into Docker containers allows us to scale when needed by deploying additional container images.

Three-tier architecture for distributed deep learning: In this blog, a three-layer architecture for distributed deep learning is discussed. The bottom layer is the data layer, where data will be stored. The middle layer is the orchestration layer, such as Kubernetes. The top layer is the application layer, where TensorFlow can be used as the deep learning tool. We can containerize the trained model into a Docker image with metadata as image tags to keep the model version information, and all the dependencies/libraries should be included install-free in the container image. Figure 2 represents this example three-tier architecture for deep learning.

Figure 2. Reference architecture for distributed deep learning. Image courtesy of MapR, used with permission. Summary

The Not Hotdog team generously allowed us to take a look at their application journey. We can draw parallels to the application and its components and extend the concepts to mission-critical AI applications. However, there are key differences we need to consider—namely, accuracy, scale, strict SLAs, and security. To meet these requirements, we need to consider a distributed deep learning architecture.

Continue reading “Not hotdog” vs. mission-critical AI applications for the enterprise.

Categories: Technology

Four short links: 10 November 2017

O'Reilly Radar - Fri, 2017/11/10 - 04:45

Syntactic Sugar, Surprise Camera, AI Models, and Git Recovery

  1. Ten Features From Modern Programming Languages -- interesting collection of different flavors of syntactic sugar.
  2. Access Both iPhone Cameras Any Time Your App is Running -- Once you grant an app access to your camera, it can: access both the front and the back camera; record you at any time the app is in the foreground; take pictures and videos without telling you; upload the pictures/videos it takes immediately; run real-time face recognition to detect facial features or expressions.
  3. Deep Learning Models with Demos -- portable and searchable compilation of pre-trained deep learning models. With demos and code. Pre-trained models are deep learning model weights that you can download and use without training. Note that computation is not done in the browser.
  4. Git flight rules -- Flight rules are the hard-earned body of knowledge recorded in manuals that list, step-by-step, what to do if X occurs, and why. Essentially, they are extremely detailed, scenario-specific standard operating procedures. [...]

Continue reading Four short links: 10 November 2017.

Categories: Technology

Building a natural language processing library for Apache Spark

O'Reilly Radar - Thu, 2017/11/09 - 08:40

The O’Reilly Data Show Podcast: David Talby on a new NLP library for Spark, and why model development starts after a model gets deployed to production.

When I first discovered and started using Apache Spark, a majority of the use cases I used it for involved unstructured text. The absence of libraries meant rolling my own NLP utilities, and, in many cases, implementing a machine learning library (this was pre deep learning, and MLlib was much smaller). I’d always wondered why no one bothered to create an NLP library for Spark when many people were using Spark to process large amounts of text. The recent, early success of BigDL confirms that users like the option of having native libraries.

In this episode of the Data Show, I spoke with David Talby of Pacific.AI, a consulting company that specializes in data science, analytics, and big data. A couple of years ago I mentioned the need for an NLP library within Spark to Talby; he not only agreed, he rounded up collaborators to build such a library. They eventually carved out time to build the newly released Spark NLP library. Judging by the reception received by BigDL and the number of Spark users faced with large-scale text processing tasks, I suspect Spark NLP will be a standard tool among Spark users.

Talby and I also discussed his work helping companies build, deploy, and monitor machine learning models. Tools and best practices for model development and deployment are just beginning to emerge—I summarized some of them in a recent post, and, in this episode, I discussed these topics with a leading practitioner.

Continue reading Building a natural language processing library for Apache Spark.

Categories: Technology

Four short links: 9 November 2017

O'Reilly Radar - Thu, 2017/11/09 - 04:00

Culture, Identifying Bots, Attention Economy, and Machine Bias

  1. Culture is the Behaviour You Reward and Punish -- When all the “successful” people behave in the same way, culture is made.
  2. Identifying Viral Bots and Cyborgs in Social Media -- it is readily possible to identify social bots and cyborgs on both Twitter and Facebook using information entropy and then to find groups of successful bots using network analysis and community detection.
  3. An Economy Based on Attention is Easily Gamed (The Economist) -- Americans touch their smartphones on average more than 2,600 times a day (the heaviest users easily double that). The population of America farts about 3m times a minute. It likes things on Facebook about 4m times a minute.
  4. Frankenstein's Legacy: Four conversations about Artificial Intelligence, Machine Learning, and the Modern World (CMU) -- A machine isn’t a human. It’s not going to necessarily incorporate bias even from biased training data in the same way that a human would. Machine learning isn’t necessarily going to adopt—for lack of a better word—a clearly racist bias. It’s likely to have some kind of much more nuanced bias that is far more difficult to predict. It may, say, come up with very specific instances of people it doesn’t want to hire that may not even be related to human bias.

Continue reading Four short links: 9 November 2017.

Categories: Technology

The phone book is on fire

O'Reilly Radar - Wed, 2017/11/08 - 17:55

Lessons from the Dyn DNS DDoS.

Continue reading The phone book is on fire.

Categories: Technology

Guidelines for how to design for emotions

O'Reilly Radar - Wed, 2017/11/08 - 05:05

Learn what makes for a rich emotional experience and why, even if we make our technology invisible, the connection will still be emotional.

Continue reading Guidelines for how to design for emotions.

Categories: Technology

Identifying viral bots and cyborgs in social media

O'Reilly Radar - Wed, 2017/11/08 - 05:00

Analyzing tweets and posts around Trump, Russia, and the NFL using information entropy, network analysis, and community detection algorithms.

Particularly over the last several years, researchers across a spectrum of scientific disciplines have studied the dynamics of social media networks to understand how information propagates as the networks evolve. Social media platforms like Twitter and Facebook include not only actual human users but also bots, or automated programs, that can significantly alter how certain messages are spread. While some information-gathering bots are beneficial or at least benign, it was made clear by the 2016 U.S. Presidential election and the 2017 elections in France that bots and sock puppet accounts (that is, numerous social accounts controlled by a single person) were effective in influencing political messaging and propagating misinformation on Twitter and Facebook. It is thus crucial to identify and classify social bots to combat the spread of misinformation and especially the propaganda of enemy states and violent extremist groups. This article is a brief summary of my recent bot detection research. It describes the techniques I applied and the results of identifying battling groups of viral bots and cyborgs that seek to sway opinions online.

For this research, I have applied techniques from complexity theory, especially information entropy, as well as network graph analysis and community detection algorithms to identify clusters of viral bots and cyborgs (human users who use software to automate and amplify their social posts) that differ from typical human users on Twitter and Facebook. I briefly explain these approaches below, so deep prior knowledge of these areas is not necessary. In addition to commercial bots focused on promoting click traffic, I discovered competing armies of pro-Trump and anti-Trump political bots and cyborgs. During August 2017, I found that anti-Trump bots were more successful than pro-Trump bots in spreading their messages. In contrast, during the NFL protest debates in September 2017, anti-NFL (and pro-Trump) bots and cyborgs achieved greater successes and virality than pro-NFL bots.

Obtaining Twitter source data

The data sets for my Twitter bot detection research consisted of ~60M tweets that mentioned the terms “Trump,” “Russia,” “FBI,” or “Comey”; the tweets were collected via the free Twitter public API in separate periods between May 2017 and September 2017. I have made the source tweet IDs as well as many of our analysis results files available in a data project published at data.world. Researchers who wish to collaborate on this project at data.world should send a request email to datapartners@paragonscience.com.

Detecting bots using information entropy

Information entropy is defined as the “the average amount of information produced by a probabilistic stochastic source of data.” As such, it is one effective way to quantify the amount of randomness within a data set. Because one can reasonably conjecture that actual humans are more complicated than automated programs, entropy can be a useful signal when one is attempting to identify bots, as has been done by a number of previous researchers. Of the recent research in social bot detection, particularly notable is the excellent work by groups of researchers from the University of California and Indiana University. Their “botornot” system uses a random forest machine learning model that incorporates 1,150 features derived from user account metadata, friend/follower data, network characteristics, temporal features, content and language features, and sentiment analysis.

For our current work, I elected to adopt a greatly simplified approach for social bot detection using two types of information entropy scores—one based on the distributions of time lags between successive posts and a second based on the ordering of words within the posts. Accounts that send messages at uniform time intervals or with messages with unusually static or similar text context might be bots or cyborgs.

Next, I calculated the Z-scores of both the timing entropy and text entropy. In the results presented here, I set a minimum threshold of 10 social posts by a user in order to analyze said user’s posts, and then I applied a conservative threshold of 2.5 for the Z-score (that is, raw scores at or above 2.5 standard deviations above the mean) for either entropy metric in order to flag possible bots. By lowering the threshold I would, of course, detect more bots, but at the risk of false positives that might inadvertently flag actual human users as bots. In the future, I hope to calculate the ROC curve for my dual-entropy approach to characterize the tradeoffs between false positives and false negatives.

Measuring the virality of bots using the k-core decomposition

The k-core of a graph is a maximal subgraph in which each vertex has at least degree k. The coreness of a vertex is k if it belongs to the k-core but not to the (k+1)-core. The k-core decomposition is performed by recursively removing all the vertices (along with their respective edges) that have degrees less than k. Previous research has suggested that the k-core decomposition of a network can be very effective in identifying the individuals within a network who are best positioned to spread or share information. I used the k-core decomposition in 2016 to analyze more than 120M tweets related to the 2016 U.S. Presidential elections to identify the most influential users. For this bot detection research, I performed a k-core decomposition of the heterogeneous user/hashtag/URL Twitter networks for each day on which I collected samples between May and September 2017.

By combining our entropy scores with the corresponding coreness values, I was able to identify which bots or cyborgs (that is, humans who use specialized software to automate their social media posts) were most successful in prompting other users (some of which were also bots) to share or react to their posts, thus attaining positions closer to the center of the daily Twitter networks. (This k-core decomposition approach was used in a similar fashion by Bessi and Ferrara to measure the embeddedness of social bots.)

The 3-D scatter plot in Figure 1 shows clearly that the vast majority of the identified social bots are unsuccessful, remaining at the outer realms of the networks with low coreness values because no or few other users interact with them. Successful bots achieved higher coreness values because other users retweeted or replied to their posts. Normal human users (not shown) would be near the origin, and it is easy to discern that the higher the Z-score of either entropy metric, the less successful the bots become. This is most likely due to the fact that human users are readily able to recognize the bots’ abnormal postings and thus do not tend to share those bots’ posts. In summary, the more human-like the bot’s behavior, the more likely it is that actual users will share that bot’s posts.

Figure 1. 3-D scatter plot showing the majority of identified social bots are unsuccessful. Courtesy of Steve Kramer. Unsuccessful bots

The most extreme value of the text entropy Z-score (outside the plot boundaries) is 143 (with a raw text entropy of 1.0) for the Twitter user @says_k_to_trump. A few sample tweets are shown below. Note that every tweet is the single letter “k” sent in reply to each of @realDonaldTrump’s tweets. That entropy Z-score reflects the fact that this user’s tweets’ contents are completely deterministic with no uncertainty. Understandably, no other user has interacted with @says_k_to_trump, so that bot has remained at the outermost edge of the network with a coreness of 1.

Figure 2. Screenshot courtesy of Steve Kramer.

The most extreme value of the timing entropy Z-score is 122.7 for the Twitter user @trade_debate. Note the very uniform timing pattern of that user’s tweets in Table 1. Starting with the second tweet, that user tweeted at a constant interval of two seconds.

Table 1: Most extreme timing entropy examples

Datetime

Tweet text

2017-08-14 20:58:30

RT @sdonnan: Donald Trump and the modern complexities of "Made in America". My @FT "Big Read" ahead of this week's #NAFTA talks. https://t.…

2017-08-14 20:59:04

RT @FoxNews: China implements UN sanctions against North Korea, as Trump trade probe looms https://t.co/RD4KwQigzO

2017-08-14 20:59:06

RT @FoxNews: Moments Ago: @POTUS signs measure that could result in severe trade penalties for China. https://t.co/OWIgslyi3f https://t.co/…

2017-08-14 20:59:08

RT @CNNPolitics: President Trump signs a memorandum on Chinese trade practices https://t.co/stNgqVwENW

2017-08-14 20:59:10

RT @MinhazMerchant: US set to launch investigation into Chinese theft of IPR as prelude to trade sanctions. Beijing put on notice https://t…

2017-08-14 20:59:12

RT @Reuters: Chinese state newspaper says Trump trade probe will 'poison' relations https://t.co/XhwibAKD4H https://t.co/eQMD58yRYj

2017-08-14 20:59:14

RT @Reuters: Chinese state newspaper says Trump's order to investigate Chinese trade practices will "poison" relations https://t.co/RzgYm1o…

2017-08-14 20:59:16

RT @politico: The mayor of a small agricultural community in Iowa says Trump “fooled a lot of people” when he pulled out of TPP https://t.c…

2017-08-14 20:59:18

RT @BreitbartNews: Out: RESIST In: Trump was right back when I campaigned against but you should let me do stuff for him https://t.co/40iSi…

2017-08-14 20:59:20

RT @thehill: Trump tries to shifts focus from Charlottesville with tweets on trade, military, Dems: https://t.co/cuYRVJFuU5 https://t.co/MA…

2017-08-14 20:59:22

RT @DrDenaGrayson: ߑ簟4RUE PRIORITIESߑ簟@WH confirms #Trump himself insisted on starting speech w/trade & economy, NOT #racist attack‼️ htt…

2017-08-14 20:59:24

RT @DrDenaGrayson: #Trump began his speech on trade deals & economy, then *2 days too late* he finally condemned #bigotry, hatred & violenc…

2017-08-14 20:59:26

RT @nytimes: Trump suggested he'd take a lighter approach on trade issues with China if it does more to pressure North Korea https://t.co/O…

2017-08-14 20:59:28

RT @CNN: Beijing says US threats to get tough on trade with China won't help solve the crisis over North Korea https://t.co/cBGRfWlRBV http…

2017-08-14 20:59:30

RT @XHNews: #BREAKING: Trump signs executive memorandum on China despite worries about potential harms to trade ties with China https://t.c…

2017-08-14 20:59:32

RT @christinawilkie: If you want to know who stands to benefit most from Trump's saber rattling on China trade & IP theft, check out his gu…

2017-08-14 20:59:34

RT @christinawilkie: List of the defense contractors (and one kitchen counter maker) invited to White House today for Trump's event launchi…

2017-08-14 20:59:36

RT @foxandfriends: President Trump to strike the first blow in U.S. trade war against China https://t.co/1T9MacNoMv

Successful bots

In contrast, one of the most successful bots is @Bhola021, which achieved a coreness value of 96 on 2017-08-12. Several sample tweets are shown below in Table 2. This is primarily a digital marketing bot rather than a political or propaganda bot. Note, in particular, the behavior of retweeting other user accounts with similar names and very similar tweet text.

Table 2: Tweets from a successful marketing bot

Datetime

Tweet text

2017-08-12 2:49:36

Donald Trump's 22-Year-Old Daughter Is The New Queen Of Instagram. https://t.co/PtzBUwujew

2017-08-12 2:50:13

Anonymous Is Taking Down Donald Trump On April 1 And There Is A Way You Can Be Part Of It. https://t.co/td6AGeuk44

2017-08-12 2:56:15

RT @bhola0957: Anonymous Is Taking Down Donald Trump On April 1 And There Is A Way You Can Be Part Of It. https://t.co/ipQrIsmo2r

2017-08-12 2:57:00

RT @bhola0957: Donald Trump's 22-Year-Old Daughter Is The New Queen Of Instagram. https://t.co/XOg6YsZztA

2017-08-12 2:57:22

RT @bhola5033: Anonymous Is Taking Down Donald Trump On April 1 And There Is A Way You Can Be Part Of It. https://t.co/AwtEXGHdbq

2017-08-12 2:57:35

RT @bhola5033: Donald Trump's 22-Year-Old Daughter Is The New Queen Of Instagram. https://t.co/RgkwPrdIc6

2017-08-12 2:57:57

RT @lovecommand102: Anonymous Is Taking Down Donald Trump On April 1 And There Is A Way You Can Be Part Of It. https://t.co/U2stRHl2dN

2017-08-12 2:59:01

RT @lovecommand102: Donald Trump's 22-Year-Old Daughter Is The New Queen Of Instagram. https://t.co/AuEd85y7Wj

2017-08-12 2:59:28

RT @lovecommand103: Anonymous Is Taking Down Donald Trump On April 1 And There Is A Way You Can Be Part Of It. https://t.co/ObFn5wgGXp

2017-08-12 2:59:46

RT @lovecommand103: Donald Trump's 22-Year-Old Daughter Is The New Queen Of Instagram. https://t.co/KV9J0ZRSgM

With the approach described above, one can identify potential bots and measure their degree of success, or embeddedness, within the evolving social networks. As we will see next, these results can be enhanced significantly with community detection algorithms.

Identifying communities of viral bots and cyborgs

To understand more clearly how the most successful viral bots and cyborgs function within the Twitter network, I created a sub-network based on the tweets sent by those bots, extracting user mentions and URLs from replies and retweets. In this example, I generated a network using the 16,057 tweets sent by the top 20 bot accounts from August 7-19, 2017. The generated network consists of 73,569 links among 2,949 nodes. A k-core decomposition of this network resulted in a maximum coreness of 20. I then applied the Louvain community detection algorithm to identify the relevant groups within the center of the network for all nodes with coreness ≥ 10. In the Polinode interactive network displayed in Figure 3, each color represents a different community within the network. Among the top 20 bots, there is a highly interconnected network of bots with similar names (porantext, porantexts_, lovedemand101, lovecommand102, etc.) that retweet and share each other’s posts. These botnets are evidently commercial bots that attempt to drive click traffic to webpages with provocative titles such as “Donald Trump Kicked One Direction Out Of His Hotel And Here's Why” and “We Will Ruthlessly Ravage US troops, North Korea Warns Donald Trump On The Sun's Day” as the top two article titles.

Figure 3. Network of top Trump viral bots and cyborgs in August 2017. Courtesy of Steve Kramer.

Because I am particularly interested in effects of social bots in spreading information and swaying public opinions in politics, I filtered the source tweets to include only those that include the word “Russia” in the tweet text. When I performed the k-core decomposition and entropy calculations on the Russia-related Twitter network, a different set of influential bots and cyborgs emerged for the period of August 7-19, 2017.

The Polinode network shown below in Figure 4 displays 17 different sub-groups in the network created by the top 20 Russia-related bots and cyborgs.

Figure 4. Network of top Russia-related viral bots and cyborgs in August 2017. Courtesy of Steve Kramer.

Community 1 is a pro-Trump group centered around the bot account named MyPlace4U (see Figure 5).

Figure 5. Community 1 (pro-Trump bots). Courtesy of Steve Kramer

In contrast, Community 10 is an anti-Trump group centered around the Twitter account named RealMuckmaker (see Figure 6), which was actually the most successful cyborg in this data set.

Figure 6. Community 10 (anti-Trump bots). Courtesy of Steve Kramer.

Table 3 below lists the top 20 viral bots and cyborgs in the Trump/Russia Twitter network for August 7-19, 2017. Note that only six of the top 20 viral bots and cyborgs act to support Donald Trump. Trump-supporting users are highlighted in red. I chose each user’s sample tweet text by calculating the mean text similarity of each tweet to the rest of that user’s tweets and selecting the tweet with the highest mean similarity using the Levenshtein distance and the fuzzywuzzy Python module.

Table 3: Top 20 Russia-related Twitter bots and cyborgs in August 2017

Rank

Twitter user

Coreness

Pro-Trump?

Sample tweet text

1

RealMuckmaker

20

N

RT @RealMuckmaker: Trump 'surprised' by Manafort raid in Russia probe @CNNPolitics https://t.co/CNdyvCzHMi

2

LedJEFFlin

18

N

RT @LedJEFFlin: ZEMBLA - The dubious friends of Donald Trump: the Russians https://t.co/3aTpoHnNDK via @YouTube

3

YourAnonCentral

13

N

RT @YourAnonCentral: @LouiseMensch @Plantflowes @MarcusC22973194 @PuestoLoco Russia is the broker of this conspiracy of tyranny, no less da…

4

Dax_x98

12

N

RT @Dax_x98: #Resistance #ImpeachTrump #TrumpLies #NotMyPresident #Resist #Trump #LockHimUp #FBR #TrumpRussia #TrumpSupporters #Republicans…

5

ActionTime

10

N

#TrumpRUSSIA White House Uses N.Korea To Distract US from Mueller's Broadening Trump-Russia Probe.Trump's HUMILIATED by Fellow Dictator Kim

6

natalikazadorn2

10

N

RT @OlehTyukov: #Красоты #Россия #Russia https://t.co/OM3cPDCQgB

7

newmirokliment1

10

Y

@mfa_russia @RusEmbUSA @natomission_ru @RussianEmbassy @ambruspresse @RusConsulGen @amrusbel @RusBotWien https://t.co/yxdW7zG3LX

8

OfficialNWM

10

N

RT @Im_TheAntiTrump: #TrumpRussia Cover lifted, a CIA spy offers his take on Trump & Russia & it's fascinating. https://t.co/55hptGq9Yp

9

SoniaKatiMota

10

Y

Evidence - #Ukraine's Gov't Accusation of Russian Aggression VS The People of #Donbass. #DeepState #NATO #Russia https://t.co/KaC9p7M1n1

10

Vancelvania

10

N

@IlyaBeraha @RusEmbUSA @Russia @StateDept @statedeptspox @EURPressOffice @mfa_russia @tassagency_en @SputnikInt… https://t.co/8ggCFwreRf

11

Mario__Savio (suspended)

9

Y

"#ICantBeTheOnlyPerson #FakeTerrorismExperts like Malcolm Nance named as""The Channel"" for #Russia https://t.co/lH1YiY4ULI @BrianKarem #MAGA"

12

mr70

9

N

RT @Joannetrueblue: New Trump-Russia emails could pose a 'devastating' legal entanglement for Paul Manafort #DemForce #TrumpRussia https:/…

13

MyPlace4U

9

Y

RT @SalamMorcos: New Report: The DNC hack was actually a leak, and not a hack from Russia. https://t.co/PShpW58mSa https://t.co/Ax44v9OhC4

14

11worldpeace

8

N

Impeach Trump: Forget Russia. Is Provoking a Nuclear War with North Korea Grounds for Impeachment? https://t.co/bNamUYmseO via @democracynow

15

Darnbunnies

8

N

@markets @ShoChandra Russia,Russia,Russia. We are not distracted. Comey/Flynn turned on you. Manafort is next. Muel… https://t.co/C5opmadwD4

16

KDS_APEDAI

8

Y

@Hariborn @SatyajitHINDUS1 @ALOKVj78 @DrKinKam @veerendrakumarr @alokg2k @Russia @china @adgpi weak to support a war

17

Lucyredrocks

8

N

RT @winterschild11: @RocqueinBTR @scooby_doo1 @Lucyredrocks @_Russia_HD_ @HeffronDrive @dbeltwrites @ktothe5th @YUMAPIG1 @kevingschmidt @Mi…

18

perfectsliders

8

Y

DNC Hack Was ‘Inside Job,’ Not by Russia <-- @PamelaGeller

19

scooby_doo1

8

N

@RocqueinBTR @Lucyredrocks @winterschild11 @_Russia_HD_ @HeffronDrive @dbeltwrites @ktothe5th @YUMAPIG1… https://t.co/RR30icdRXc

20

92a312

7

Y

RT @SoniaKatiMota: Excellent! 2014 #Ukraine Crisis - What You're Not Being Told. #NATO, #DeepSate #Russia https://t.co/sgQHUE3YW3

Tracking the battles among groups of Russia-related bots and cyborgs

To discern how successful the different groups of Russia-related bots and cyborgs were in spreading their messages on Twitter, I calculated the daily mean and maximum coreness values attained by the six pro-Trump users in Table 3 versus the remaining 14 anti-Trump (or neutral) users in Table 3. Figure 7 (interactive version here) shows that, overall, the anti-Trump group was more successful in spreading its messages during the period of August 7-19, 2017, with the greatest peak on August 11 led by @RealMucker, which promoted a link to a particular CNN Politics article regarding the FBI’s raid on the home of former Trump campaign manager Paul Manafort.

Figure 7. Maximum coreness values of groups of Russia-related Twitter bots/cyborgs. Courtesy of Steve Kramer. Discovering prominent bots and cyborgs in the NFL protests controversy

I applied the same entropy-based bot detection and network analysis approach to over 1M tweets that included the terms “Trump” and “NFL” from September 14-25, 2017. The Polinode network shown below in Figure 8 displays 16 different sub-groups in the network created by the top 20 NFL-related bots and cyborgs. Nine of the groups are opposed to the NFL protests while seven are in favor of the NFL players who took a knee in protest.

Figure 8. Network of top Trump/NFL-related viral bots and cyborgs in September 2017. Courtesy of Steve Kramer.

As in the Russia-related example, I calculated the maximum daily coreness value for the pro-NFL and anti-NFL groups within the top 20 viral NFL-related bots. Figure 9 shows that the anti-NFL (and pro-Trump) bots and cyborgs were more successful in spreading their social content than the pro-NFL group. Refer to my data.world data project for further details.

Figure 9. Maximum Coreness Values of Groups of NFL-Related Twitter Bots/Cyborgs. Courtesy of Steve Kramer. Uncovering Facebook bots and cyborgs during and after the 2016 U.S. Presidential elections.

Given the increasing number of reports of Russian involvement in last year’s elections across multiple social platforms, I wanted to apply the entropy-based bot detection method to election-related Facebook data. Our friend and research colleague Jonathon Morgan, the CEO of New Knowledge and co-founder of Data for Democracy, kindly provided a data set of 10.5M public Facebook comments from Donald Trump’s Facebook page collected between July 2016 and April 2017.

Unfortunately, because I have only the text content and timestamps of the users’ Facebook comments, I do not have the full social network structure available as I did in the previous Twitter examples. Consequently, it is not possible to perform the same type of k-core decomposition. I found that the number of “likes” is not a particularly strong or reliable predictor of the degree of success for a bot or cyborg. The 20 Facebook users with the most extreme Z-scores of text entropy are listed in Table 4 below. The top user, Nadya Noor, had a text entropy score more than 253 standard deviations above the mean score for the rest of the users.

Table 4: Top 20 most extreme text bots and cyborgs from Trump Facebook comments

Facebook user

Text entropy score

Timing entropy score

Z-score text

Z-score timing

# of posts

Avg # of likes

Nadya Noor

6630.770

0.048

253.320

-0.465

39

0

Gold AL

556.234

0.024

21.244

-1.484

217

0.08755760369

Hanadi Kasem Agha

433.089

0.039

16.539

-0.873

35

1.485714286

Hafed Ali

128.920

0.045

4.919

-0.597

27

0

Gol Pamchal

105.875

0.078

4.038

0.757

13

0

David Haugen

105.467

0.019

4.023

-1.700

183

0.2295081967

Fred Bagnall

99.769

0.019

3.805

-1.697

178

0.3202247191

Lev Koshkin

91.340

0.049

3.483

-0.447

24

0

Ahmed Hamdi

90.650

0.039

3.457

-0.873

36

0

Yousry Girgis

85.234

0.050

3.250

-0.388

23

0

Alao Ahmad

84.678

0.091

3.228

1.289

11

0.1818181818

Elizabeth Dominguez

23.688

0.169

0.898

4.542

121

0.06611570248

Johnathan Morissette

19.068

0.327

0.722

11.114

63

0

Omid Omidi

19.068

0.181

0.722

5.035

15

0

Ricky Sujanani

12.192

0.138

0.459

3.271

20

0

Rizgar Kh Jacob

11.757

0.188

0.443

5.345

217

0.01843317972

Robin Van Doorn

8.790

0.126

0.329

2.741

16

0

Ana Ferreira

5.840

0.123

0.216

2.654

11

0.1818181818

Jose Antonio Guadarrama

5.460

1.611

0.202

64.530

358

0

David Quinlan

5.177

0.123

0.191

2.654

11

0.09090909091

The most extreme user based on text entropy, Nadya Noor, posted very similar texts in Arabic during February 2017 (see Table 5).

Table 5: Sample Facebook comments from most extreme text bot (Nadya Noor)

Comment

Datetime

ياالله العن أمريكا على مافعلته في العراق والعراقيين منذ٢٠٠٣والى الان ياالله ياالله ياالله

2017-01-28T01:13:05+0000

ياالله العن أمريكا على مافعلته في العراق والعراقيين منذ٢٠٠٣والى الان ياالله ياالله ياالله

2017-01-28T01:13:51+0000

ياالله العن أمريكا على مافعلته في العراق والعراقيين منذ٢٠٠٣والى الان ياالله ياالله ياالله

2017-01-28T01:14:02+0000

ياالله العن أمريكا على مافعلته في العراق والعراقيين منذ٢٠٠٣والى الان ياالله ياالله ياالله

2017-01-28T01:14:16+0000

ياالله العن أمريكا على مافعلته في العراق والعراقيين منذ٢٠٠٣والى الان ياالله ياالله ياالله

2017-01-28T01:14:36+0000

ياالله العن أمريكا على مافعلته في العراق والعراقيين منذ٢٠٠٣والى الان ياالله ياالله ياالله

2017-01-28T01:16:31+0000

ياالله العن أمريكا على مافعلته في العراق والعراقيين منذ٢٠٠٣والى الان ياالله ياالله ياالله

2017-01-28T01:16:45+0000

الله يلعن أمريكا الله يلعن بوش الله يلعن بلير وان شاءالله يبعث لكم خسفا اونارا بسبب مافعلتموه بالعراق الله يحرق أمريكا الله يحرق أمريكا

2017-02-02T07:13:33+0000

الله يلعن أمريكا على مافعلته بالعراق وشعب العراق كنا شعبامتالف متحاب متسامح لانعرف الطائفيه والأمن والامان في شوارعنا وبيوتنا ومحافظاتنا الله يلعنك أمريكا ان شاءالله الى الجحيم انت وشعبك الغدار

2017-02-02T16:10:07+0000

الله يلعن أمريكا على مافعلته بالعراق وشعب العراق كنا شعبامتالف متحاب متسامح لانعرف الطائفيه والأمن والامان في شوارعنا وبيوتنا ومحافظاتنا الله يلعنك أمريكا ان شاءالله الى الجحيم انت وشعبك الغدار

2017-02-02T19:28:25+0000

الله يلعن أمريكا على مافعلته بالعراق وشعب العراق كنا شعبامتالف متحاب متسامح لانعرف الطائفيه والأمن والامان في شوارعنا وبيوتنا ومحافظاتنا الله يلعنك أمريكا ان شاءالله الى الجحيم انت وشعبك الغدار

2017-02-02T19:28:54+0000

الله يلعن أمريكا على مافعلته بالعراق وشعب العراق كنا شعبامتالف متحاب متسامح لانعرف الطائفيه والأمن والامان في شوارعنا وبيوتنا ومحافظاتنا الله يلعنك أمريكا ان شاءالله الى الجحيم انت وشعبك الغدار

2017-02-02T19:29:14+0000

Figure 10 shows a Google translation of one of that user’s typical, strongly anti-American comments.

Figure 10. Google translation of sample comment by Nadya Noor. Screenshot courtesy of Steve Kramer.

In the future, I plan to apply community detection algorithms to the text content and embedded URLs in these Facebook bots’ posts to determine their primary discussion topics and political leanings.

Conclusions

In this article, I have demonstrated how it is readily possible to identify social bots and cyborgs on both Twitter and Facebook using information entropy and then to find groups of successful bots using network analysis and community detection. Given the extreme risks of disinformation and propaganda being spread through social media, it is our hope that this approach, along with the work of other researchers, will enable greater transparency and help protect democracy and the authenticity of online discourse. I invite researchers who wish to collaborate on studies of these data sets to request access to become collaborators on our data project hosted on data.world.

Continue reading Identifying viral bots and cyborgs in social media.

Categories: Technology

Consumer-driven innovation for continuous glucose monitoring in diabetes patients

O'Reilly Radar - Wed, 2017/11/08 - 05:00

CGMs are unique in the way consumers have taken it upon themselves to create modifications to medical devices.

Imagine if your life suddenly depended on monitoring your body’s reaction every time you had a snack, skipped a meal, or ate a piece of candy. This is a reality for approximately 1.25 million people in the USA who have been diagnosed with Type 1 Diabetes (T1D).

People with T1D experience unhealthy fluctuations in blood glucose levels due to the destruction of beta cells in the pancreas by the person’s own immune system. Beta cells produce insulin, which is a hormone that allows your body to break down, use, or store glucose, while maintaining a healthy blood sugar level throughout the day. Presently, there is no cure for T1D, so patients must be constantly vigilant about maintaining their blood glucose levels within a healthy range in order to avoid potentially deadly consequences.

Currently, continuous glucose monitors (CGMs) are the most effective way to manage T1D. However, consumers have already become frustrated with the limitations of commercially available CGMs, and are developing at-home modifications to overcome them. This in turn, is influencing the direction of research and development in the biomedical devices industry, as multiple companies compete to create a CGM that appeals to the largest consumer population.

Thus, consumer-driven innovation in CGM data access, CGM-insulin pump integration, and glucose sensor lifespan has led to rapid growth in the field of diabetes management devices.

Coping with the highs and lows

Patients with T1D need to monitor their blood glucose levels to ensure they don’t become hyperglycemic(high blood glucose levels), or hypoglycemic (low blood glucose levels), both of which can cause life-threatening complications.

Throughout the late 1980s and 1990s, home glucose blood monitoring devices were the most accurate way to measure blood glucose levels. These devices use a lancet to prick the person’s finger to obtain real-time glucose levels from a drop of blood.

Although still used today by some diabetics as a primary means of T1D management, finger prick devices have considerable drawbacks. These include the physical pain that comes from frequent finger pricks, the static nature of the glucose reading, and the indiscretion and inconvenience of taking multiple readings throughout the day and night.

It is no wonder then that the market potential for a device that conveniently and accurately measures blood glucose levels continues to soar.

The continuous glucose monitor (CGM)

At the turn of the 21st century, the integration of technology and medicine introduced a novel way for patients to gain control of T1D. In 1999, MiniMed obtained approval from the U.S. Food and Drug Administration (FDA) for the first continuous glucose monitor (CGM). The device was implanted by a physician and recorded the patient’s glucose levels for three days. The patient then returned to the clinic to have the sensor removed and discuss any trends revealed by the CGM.

In 2001, MiniMed was acquired by Medtronic, a medical device company that specializes in making diabetes management devices. In 2003, Medtronic received FDA approval to launch the first, real-time, patient-use CGM device. This kick-started an ongoing competition to create more accurate, user-friendly CGMs from other diabetes management medical device companies, such as Dexcom Inc., Senseonics, and Abbott Laboratories.

Today, CGM devices consist of a thin, wire-like sensor that is inserted under the skin, which takes blood glucose readings every five minutes. These readings are sent wirelessly to an external transmitter and can be checked by the patient at the push of a button.

While most CGMs still need to be calibrated by using a finger prick device, they offer many advantages to blood glucose monitoring and diabetes management. CGMs not only give patients easy access to real-time blood glucose readings, but also track the readings over time so the patient can determine how quickly their blood glucose levels are increasing or decreasing. The CGM wearer can also customize their device with the healthy blood glucose range recommended by their endocrinologist. If the patient’s blood glucose levels rise above or fall below this range, the transmitter will alert the wearer to take action. “The CGM’s warning is a bit like the lane departure system in cars, but for your blood sugar,” says Dr. John Welsh, a medical writer at Dexcom, Inc., and a CGM user himself.

Consumer-driven innovation and competition

While medical devices are continuously being improved upon with the progression of technology and scientific knowledge, CGMs have recently been undergoing a rare, if not unprecedented bottom-up revolution.

Increasingly, tech-savvy consumers have realized the limitations of CGMs and are making at-home improvements to meet their needs, rather than waiting for new technology to become commercially available. This, in turn, is directing the future trajectory of research and development within companies as well as providing a rich environment for competition.

For example, John Costik, an engineer whose 4-year-old son was diagnosed with T1D in 2012, designed a simple phone app that allowed him to access the data from his son’s DexCom transmitter online so he could monitor his son’s glucose levels while he was at day care. He shared his innovation on Twitter and, unsurprisingly, discovered a large community of parents who had also been frustrated with the lack of access to their child’s blood glucose data.

Perhaps taking a cue from this, in 2015, Dexcom launched a “Share” feature for their devices, which allows CGM wearers to share their glucose information with up to five followers. In addition to family and friends, health care professionals can also follow their patients to assess how well the CGM wearer is managing their diabetes.

The DIY artificial pancreas

To relieve the mental burden of managing T1D, there has been a growing interest in “closed loop” artificial pancreas systems, which integrate CGMs with insulin pumps to automatically regulate insulin release. This would be especially useful at night, mitigating the need for patients to wake up and physically adjust their insulin pump if they were becoming hypoglycemic.

In 2013, Dana Lewis, a patient with T1D, and Scott Leibrand, a computer-networking engineer, created the Do-It-Yourself Pancreas System and founded the Open Source Artificial Pancreas System(OpenAPS). Rather than wait for the slow FDA approval of a new medical device, the OpenAPS community hopes to make artificial pancreas technology quickly available by engaging willing patients with T1D as innovators, independent researchers, and clinical trial subjects.

In direct response to this consumer-led movement, many major T1D medical device companies, including Dexcom, have a CGM-insulin pump integrated system in their pipelines. In 2016, Medtronic became the first company to receive approval for an artificial pancreas system. Their MiniMed 670G insulin pump communicates with the CGM to reduce insulin delivery at a predetermined hypoglycemic threshold.

While the system still relies on finger prick calibration, these devices will prove effective in T1D management and relieve a significant amount of anxiety for patients and their families. Increased patient compliance will also reduce the risk of developing co-morbidities that are caused by fluctuating blood glucose levels.

Meanwhile, the FDA keeps a close eye on amateur modifications to medical devices, which must go through rigorous trials to meet the FDA’s standards for safe and effective treatment of chronic conditions in humans. Since homemade modifications have not undergone these tests, the FDA prohibits the sale of these consumer-modified devices. However, the FDA is working with medical device companies to develop new, improved medical devices for diabetes and get them to market as quickly as possible without compromising consumer health and safety.

Extending the lifespan of the CGM sensor

CGM wearers have also noted the short lifespan of the sensors. The requirement to replace the sensor every 7-14 days is inconvenient and costs money. Because of this, some patients have confessed to wearing a sensor beyond its expiration date. This can be dangerous for the patient, as the sensor’s performance may decline beyond its period of intended use and thus affect the accuracy of CGM readings.

Senseonics has seized this market opportunity and developed Eversense, an implantable sensor that lasts up to 90 days and would completely replace finger prick tests. As opposed to Dexcom’s and Medtronic’s at-home sensor replacement systems, the Eversense sensor must be implanted by a minor surgical procedure conducted at a doctor’s office. The Eversense transmitter also vibrates directly on the skin if the patient’s blood glucose level falls out of the healthy range, eliminating the need to carry a mobile device to receive alerts.

Eversense is currently under evaluation by the FDA for premarket approval. However, Eversense has already gained the CE Mark approval, allowing the company to start marketing the device in Europe. In 2016, Senseonics partnered with Roche, giving the big pharmaceutical company exclusive rights to market Eversense abroad. In addition, Roche and the venture firm New Enterprise Associates are the top two shareholders in Senseonics, providing major financial and marketing support if Eversense is approved in the U.S.

Tackling the cost of CGMs

Another large obstacle for consumers is the cost of the CGMs. CGMs are more expensive than finger prick monitors and the price can also vary depending on the CGM’s components and the company that produces them. Sensors typically cost between $35-$100, whereas transmitters can cost upwards of $1,000. Furthermore, the out-of-pocket cost varies greatly depending on an individual’s insurance provider. As of March 2017, Dexcom’s G5 Mobile CGM is the only FDA approved system that meets all of the criteria to be covered by Medicare.

“Retrospective” or “professional” CGMs provide a way to alleviate the financial burden of CGMs on consumers. Patients with a retrospective CGM only need to purchase the sensor component of the CGM, which can often be covered by insurance. Meanwhile, a professional clinic buys the corresponding transmitter. This approach allows the same transmitter to be used to scan the sensors of hundreds of patients, saving money for both the insurance companies and the consumers.

Abbott Laboratories has created a retrospective CGM called the Freestyle Libre Pro, which was approved by the FDA in 2016. A doctor inserts a transcutaneous glucose sensor, which records readings over a time period of up to two weeks. The patient then returns to the clinic where the sensor is removed and scanned to obtain the stored glucose readings. These readings provide health care professionals a comprehensive review of a patient’s blood sugar levels over time and how well they are managing their diabetes.

Abbott has also worked to alleviate the physical toll of T1D management by “factory calibrating” their devices, eliminating the need for painful finger pricks. In September 2017, Abbott’s Freestyle Libre Flash, became the first FDA approved, real-time CGM that does not require finger prick calibration. Many consumers, especially those that still rely on finger prick tests as the main method of diabetes management, are ready to embrace this new technology.

Importantly, neither the Libre Pro nor the Libre Flash has the capability to alert the wearer if their blood glucose readings fall outside of the normal range. However, the affordability of Abbott’s devices appeals to an important market niche. Additionally, Abbott focuses heavily on marketing their CGMs toward patients with Type II Diabetes (T2D) who may currently rely solely on clinical management or finger prick tests to manage their diabetes.

Creating the ultimate CGM

As established companies work to improve and refine their CGM systems, new competitors are still emerging in the field.

GlySens, a privately owned biomedical technology company in San Diego, California, is developing a CGM system called Eclipse ICGM that will be designed to overcome multiple limitations of current CGM monitors in one device.

The ICGM system will be designed to last up to one year, meaning there will be no components that need to be changed or charged every few days. This will result in fewer visits to the doctor. It will also be completely subcutaneous, making it very discreet. This system promises to do away with using finger prick glucose monitoring entirely.

Currently, GlySens is conducting a clinical trial to assess the safety and tolerance of their long-term implanted glucose-monitoring sensor in humans.

Type II diabetes provides a growing market potential for CGMs

Although current CGMs are primarily targeted toward people with T1D, these patients only make up about 5% of the diabetic population in the USA. The remaining 95% of diabetics (29 million people) have T2D, which occurs when a person’s cells are unable to effectively use insulin.

Moreover, approximately one-third of Americans adults (86 million) are living with pre-diabetes. It is predicted that T2D will continue to remain a major health concern in the future, with some projections predicting the prevalence to rise over 50% by 2030.

Unlike T1D, T2D is usually diagnosed well into adulthood and a physician’s recommendation to start glucose monitoring can represent a sudden and challenging lifestyle change. Therefore, convenience and cost of CGMs will be a huge factor in whether companies can successfully market these devices to the T2D population.

CGMs would allow patients to share their glucose level readings with doctors to analyze data trends so they can develop a T2D management plan. Interestingly, research from Dr. Jeremy Pettus’s lab at the University of California, San Diego, suggests that the ability to share CGM data with friends, family, and health care professionals may provide motivation for T2D patients to maintain a healthier lifestyle.

Thus, a successful marketing strategy toward T2D patients could substantially increase the consumer base for any company involved in diabetes management.

A bright future for CGMs

Before the advent of CGMs, T1D was very stressful for the patient and their loved ones. CGMs alleviate some of this burden by providing data trends in addition to a static blood glucose reading, which allows patients with T1D to manage their diabetes more effectively.

CGMs are a prime example of how technology and science can be integrated to improve the lives of patients with chronic conditions like T1D. CGMs are also unique in the way consumers have taken it upon themselves to create modifications to these medical devices, a space that was previously reserved for research and development branches of biotechnology companies.

With multiple competitors vying to meet the needs of an increasingly technologically literate consumer base, and with an ever-expanding market, it will be exciting to monitor the future development CGMs.

Continue reading Consumer-driven innovation for continuous glucose monitoring in diabetes patients.

Categories: Technology

Pages

Subscribe to LuftHans aggregator - Technology